Agent MQI

Agent Quality · Self-Improving Agents · Memory

Composite quality score that detects when your AI coding assistant is degrading. 24 metrics across thinking, research, execution, trust, throughput, and environment.

24 metrics, 2,740 sessions, 6 behavioral dimensions

Agent MQI dashboard overview

56.4% of My Words Are Negative

That number came out of running sentiment analysis across 2,740 Claude Code sessions. More than half of everything I typed to my AI assistant was frustration, correction, or profanity. Not because the tool is bad. Because there was no signal telling me when it was degrading.

The quality of an AI coding session is not binary. It does not go from "working" to "broken." It drifts. Reasoning depth drops first. Then the agent starts editing files it has not read. Then it stops explaining what it is doing. By the time you notice the output is wrong, you have already lost an hour.

Agent MQI is a composite quality score built to catch that drift before it costs you time. It parses Claude Code session transcripts and produces a single number: MQI, the Model Quality Index.

Sentiment trend over 2,740 sessions showing negative word frequency

Keyword sentiment trend across sessions. The dip in late-session averages correlates with increased re-instruction rate and user interrupt count.

The Problem Nobody's Measuring

Teams measure output quality. Code passes tests or it does not. PRs ship or they do not. What nobody measures is process quality: how well the agent reasoned, whether it read files before editing them, whether it understood the task or just performed the nearest completion.

Process quality is a leading indicator. Output quality is a lagging one. By the time a bad output shows up, the process failure that caused it happened three steps earlier. You can fix the output, but you cannot get back the time spent shepherding an agent through reasoning it should have done on its own.

MQI measures process quality in real time. It reads the session transcript as work happens and emits a score that reflects not just what was produced, but how it was produced.

Swearing at Your AI Is a Feature, Not a Bug

The keyword sentiment metric tracks every word you type with an emotional valence. Profanity, correction phrases, and explicit frustration markers are treated as leading indicators of agent degradation, not noise to be filtered out. Two real examples from the session corpus:

Example 1

"so fuckin do it lol wtf are you waiting for"

Situation: agent stalled, explaining plan instead of executing

Metrics triggered: keyword_sentiment, user_interrupts, zero_reasoning_turn_rate

Example 2

"stop being lazy check the console, dom, and screenshots..."

Situation: agent skipped research phase entirely

Metrics triggered: edits_without_read, keyword_sentiment, re_instruction_rate

Each of these events is now a data point. Across 2,740 sessions, patterns emerge that correlate strongly with session outcome: sessions with high keyword sentiment negativity in the first third are 3.2x more likely to require full restarts.

The 24-Metric Framework

MQI is composed of 24 metrics grouped into six behavioral dimensions. Each dimension gets a weighted contribution to the final score. The weights were calibrated against a held-out corpus of sessions with known outcomes: sessions marked "successful delivery" vs "abandoned" vs "rework required."

Group Metrics Weight
Thinking thinking_depth, reasoning_loops, zero_reasoning_turn_rate, redaction_rate 19%
Research read_edit_ratio, research_mutation_ratio, simplest_fix 16%
Execution edits_without_read, write_edit_ratio, premature_stopping, repeated_edits, stop_hook_violations, reversion_rate, post_compaction_drift, human_time_estimation, trial_and_error_debugging 23%
Trust user_interrupts, self_admitted_failures, keyword_sentiment, re_instruction_rate, implicit_constraint_violator 18%
Throughput token_rate_per_minute 5%
Environment incident_exposure, issue_velocity 19%
Radar chart showing six MQI dimensions and their relative weights

Radar view of the six dimensions for a representative session. Execution and Environment dominate the score; Throughput is a minor signal but correlates with tool stalls.

Per-Model Baselines

MQI is not a universal scale. A score of 35 means something different for Opus than for Haiku. Each model has a baseline derived from its own session history. Degradation is measured as deviation from that baseline, not as an absolute threshold.

Model Sessions Avg MQI
Sonnet 4.6 100 35.8
Haiku 4.5 52 28.9
Opus 4.6 247 12.0
Opus 4.7 146 10.6
MQI score distribution broken down by model

MQI distribution by model across all sessions.

Model degradation curves showing MQI drift over session length

Degradation curves: MQI vs session length per model.

Opus scores low in absolute terms because it takes on harder tasks: longer sessions, more file reads, more reasoning loops. The per-model baseline corrects for this. What matters is whether a session is below its model's baseline, not whether it is below 35.

The Dashboard

The dashboard is a static HTML file served over a local Python server. It reads mqi.json from disk and renders six panels. Each panel targets a specific diagnostic question.

Why MQI Moved

Breakdown showing which metric groups contributed to MQI change

Contribution breakdown per session. When MQI drops, this panel shows which dimension pulled it down: Execution penalties, Trust signals, or Environment pressure.

MQI Trend

Rolling MQI trend over recent sessions

Rolling 20-session average with per-session scatter. The trend line is the signal; scatter is the noise. Sustained drops below baseline trigger a degradation alert.

External Signals

External signal correlations: API latency, compaction events, tool stalls

API latency, compaction events, and tool stall counts plotted alongside MQI. These environmental factors do not cause poor reasoning, but they correlate with it.

Keyword Sentiment

Keyword sentiment timeline showing positive and negative word frequency

Positive and negative word frequency over session time. Negative spikes in the first third of a session are the strongest early predictor of a low final MQI score.

Hourly and Daily Thinking Patterns

Thinking depth by hour of day across all sessions

Thinking depth by hour of day.

Thinking depth by day of week across all sessions

Thinking depth by day of week.

Thinking depth is not uniform across time. Sessions started in the early morning show consistently higher reasoning loop counts. This is not a model property: it reflects the complexity of tasks I start at different times of day.

CC Version Correlation

MQI scores grouped by Claude Code version

MQI grouped by Claude Code CLI version. Version upgrades occasionally shift baseline behavior; this panel catches regressions before they affect output quality.

What 2,740 Sessions Reveal

  • Sessions where the agent edits a file without first reading it are 2.8x more likely to require reverting the change within the same session.
  • Zero-reasoning turns cluster at session boundaries: immediately after compaction events and in the first turn after a long user pause. Both are observable signals.
  • The single strongest predictor of session abandonment is not a low MQI score. It is a high re-instruction rate combined with a low self-admitted failure rate. The agent is getting corrected but not acknowledging it.
  • Haiku sessions have a lower absolute MQI but a smaller standard deviation. Haiku is consistently mediocre; Opus is occasionally excellent and frequently poor. For predictable output quality, Haiku wins.

Next: Geographic Routing for Off-Peak Inference

The hourly thinking data points to an untested hypothesis: if reasoning quality correlates with time of day, it may also correlate with which inference region handles the request. Peak-hour US traffic routes to overloaded US clusters. Off-peak requests hit less contested infrastructure.

The next iteration of MQI will track inferred routing alongside session quality. If the signal holds, the system can recommend request timing or region hints to recover thinking depth without changing models.

Try It

Agent MQI is a Rust binary that parses Claude Code session transcripts and writes a JSON summary to disk. The dashboard is a static HTML file with no server requirement.

git clone https://github.com/Alex-Zeo/agent-mqi.git
cd agent-mqi
cargo build --release
./target/release/mqi -o dashboard/data/mqi.json
cd dashboard && python3 -m http.server 8080

The binary scans ~/.claude/projects/ for session transcripts automatically. No configuration required for a default Claude Code installation.

Self-Improving Pattern

Self-monitoring quality: per-model baselines auto-calibrate, keyword sentiment tracks frustration as a leading indicator, real-time hooks warn when thinking depth drops

Read the thesis →