Skill

dashboard-qa-loop

April 21, 2026

qadashboardinteraction-testingralph-loop

Trigger

User wants to QA a running web dashboard: find issues, interaction-test each element, and run fix-verify loops until the dashboard is clean

Version: 260423

Changelog

260423: Hover content standards

D4: added content-standard rules: no bare numbers, coverage requirement, no em dashes
D4: added tooltipContentOk() helper for em-dash and bare-number detection
Version bump to 260423

260422b: Major upgrade (self-improving + design quality + DB mutation + viewport)

Added Phase 0b: Viewport & Responsive Survey (4 breakpoints, overflow checks)
Added Phase 1d: Design Quality Lens (visual quality, typography, hover, dual-axis, colorblind)
Added Phase 2b: DB/Cache Mutation Verification (button/form → DB diff protocol)
Added Phase 4: Self-Improving Loop (GEPA skill crystallization, corpus-aware, ~/.rusty-data/dashboard-qa-history.jsonl)
Added run_record output, design_issues output
Added browser_resize + browser_hover to mounted_allowed_tools
Design quality lens draws from topics/wiki-image-eval-rubric 12-dimension rubric
Self-improving pattern uses Pattern 2 (Skill Crystallization) + Pattern 3 (Metric Ratchet) from skills/self-improving-agent-patterns
Expanded weaknesses with design + self-improvement blind spots

260422a: Initial creation

Codifies the dashboard QA workflow developed during MQI v3 scoring audit (rusty-bloomnet)
Ralph loop adapted from definitions/ralph-loop for issue-driven QA iteration
Incorporates DS/DE/AI-eng weakness taxonomy from session 2026-04-22

A systematic workflow for auditing a running web dashboard across four dimensions: data correctness, API contract, scoring logic, and design quality. Discovers all interactive elements, tests responsiveness at multiple viewport sizes, catalogs issues with severity, then runs a ralph loop around each: reproduce → hypothesize → fix → interaction-test → visual-check → close or iterate.

At the end of every run, the self-improving loop appends a run record to a persistent history file and proposes skill updates based on the entire corpus of past runs: not just the current session.

Core insight: Most dashboard bugs are context coupling failures (rendered value correct in isolation, wrong given surrounding state) or silent DB failures (button click returns 200 but the DB state did not change). Both classes are invisible to single-element tests; only cross-component interaction + DB diff tests expose them.

Interface

Trigger: “QA the dashboard”, “run interaction tests”, “find issues in the UI”, “ralph loops”, or after any schema/API/scoring/frontend change.

Inputs:

app_url: running app URL (e.g. http://localhost:3001)
api_prefix: API base path (e.g. /api)
element_map: optional list of panels/routes to audit; if absent, do full traversal

Outputs:

issue_catalog: structured list of all found issues with severity + hypothesis
closed_issues: issues fixed and verified
open_issues: remaining after max iterations
weakness_log: categories of bugs the test stack cannot reliably catch
design_issues: subset of catalog: typography, spacing, hover, viewport, chart quality violations
run_record: JSON entry appended to ~/.rusty-data/dashboard-qa-history.jsonl

Phase 0: Element Discovery

Build the full element map before cataloging issues. Navigate every route and enumerate:

Layer	What to capture
Navigation	All tabs, routes, breadcrumb states
Controls	Pickers, date ranges, toggles, dropdowns
Charts	Chart type, axes, legend items, tooltip behavior
Panels	Panel headers, metric labels, badge values
Forms	Input fields, submit buttons, validation states
API endpoints	Every `/api/*` call triggered by each view
Console	JS errors and warnings emitted during navigation

Scope rule: If element_map is provided, skip to Phase 0b. If absent, enumerate exhaustively: every interactive element is a potential issue surface.

// Enumerate all interactive elements on a page
[...document.querySelectorAll('button, input, select, [role="tab"], [data-picker]')]
  .map(el => ({ tag: el.tagName, text: el.textContent.trim().slice(0, 40), id: el.id }))

Phase 0b: Viewport & Responsive Survey

Strength: Catches responsive bugs invisible at a single viewport. Most dashboards are built on a 1440px monitor and never tested at 375px.
Weakness: Multiple viewport tests multiply runtime; some bugs only appear at specific OS/browser combinations (font rendering, scrollbar widths).

Test at four breakpoints using browser_resize. At each breakpoint, take a screenshot and run the overflow check.

Breakpoint	Width	Represents
Mobile	375px	iPhone SE / smallest common
Tablet	768px	iPad portrait
Desktop	1280px	Laptop (common dev size)
Wide	1440px	Standard monitor baseline

Checks at each breakpoint:

// 1. Page-level horizontal overflow (no sideways scroll)
document.body.scrollWidth > window.innerWidth
  // TRUE = overflow exists → file MEDIUM issue

// 2. Chart overflow: all charts contained within their parent
Array.from(document.querySelectorAll('canvas, svg, .chart')).map(el => {
  const rect = el.getBoundingClientRect();
  const parent = el.parentElement.getBoundingClientRect();
  return { el: el.className, overflow: rect.right > parent.right + 4 };
})

// 3. Navigation accessible: key nav elements visible and not clipped
Array.from(document.querySelectorAll('nav a, [role="tab"]')).map(el => {
  const rect = el.getBoundingClientRect();
  return { text: el.textContent.trim(), visible: rect.width > 0 && rect.height > 0 };
})

// 4. Touch target size: interactive elements ≥ 44px × 44px (WCAG 2.5.5)
Array.from(document.querySelectorAll('button, a, [role="tab"], input')).map(el => {
  const rect = el.getBoundingClientRect();
  return { text: el.textContent.trim().slice(0,30), w: rect.width, h: rect.height,
           fail: rect.width < 44 || rect.height < 44 };
}).filter(x => x.fail)

Issue severity by viewport:

Overflow at 375px → HIGH (affects real mobile users)
Overflow at 768px → MEDIUM
Overflow at 1280px+ → LOW (dev machine size; unlikely in production)
Touch target < 44px → MEDIUM at any size

Phase 1: Issue Catalog

For each element, run a structured inspection. Output a catalog entry for every anomaly:

Issue #{id}
  severity:    [CRITICAL | HIGH | MEDIUM | LOW]
  category:    [data | api | scoring | design | db | responsive]
  element:     <panel or component name>
  symptom:     <what looks wrong>
  hypothesis:  <root cause: data, compute, display, coupling, or design?>
  api_check:   <which /api/* endpoint to verify first>
  viewport:    <at which breakpoint seen; "all" if always present>
  status:      OPEN

Severity classification:

Severity	Meaning
CRITICAL	Wrong data shown; user makes incorrect decisions
HIGH	Correct data, wrong context / responsive break at common size
MEDIUM	Visual/UX inconsistency; design violation; small viewport break
LOW	Cosmetic, aspirational, or rare-condition improvement

Four-Lens Inspection

Run each element through all four lenses. Each finds bugs the others miss.

DS Lens: Data Accuracy

Goal: rendered numbers match independently computed values from raw API data.

Fetch raw API response for the relevant endpoint.
Manually compute the expected rendered value (e.g., run groupZ by hand).
Extract rendered DOM value via evaluate_script / browser_evaluate.
Diff: rendered_value == expected_computed_value.

Critical invariants:

Weighted-sum: sum(w_i * z_i for all metrics) == compositeZ (5 decimal precision)
groupZ: each group’s radar spoke = sum(z_i * w_i for members) / sum(w_i for members)
Threshold crossings: metric at z=-2.1 renders in Error band, not Warning
Baseline vs current: “base” column values drawn from same session/period as “current”?

Strength: Catches silent math errors that golden-path visual checks completely miss.
Weakness: You must know the correct formula in advance. If the formula itself is wrong, the invariant check passes. You measure what you can specify.

DE Lens: API Contract

Goal: API response shape and freshness match what the frontend expects.

Capture all network requests during navigation (browser_network_requests).
For each /api/* call: check HTTP status, Content-Type, response schema fields.
Verify no stale cache: compare response timestamp against last ingest run timestamp.
Check added/removed fields don’t silently produce undefined in the frontend.

# Schema diff: before and after any backend change
curl -s http://localhost:3001/api/mqi | jq 'keys'
# Compare output across two runs; missing keys = CRITICAL

Critical rule: HTTP 200 != dashboard green. A 200 with a wrong schema silently produces NaN or 0 in the UI. Always click every panel after API/schema changes.

Strength: Catches schema drift before it reaches users.
Weakness: API field presence ≠ API field correctness. A field present with a wrong value (wrong unit, wrong aggregation period) passes the contract test. Combine with DS Lens to verify values, not just shape.

AI Engineer Lens: Scoring Invariants

Goal: scoring system internal consistency.

Verify metric weights sum to 1.0 across all groups.
Verify each metric is a member of exactly one group (no double-counting, no orphans).
Verify baseline source is stable: same session selection → same baseline z-scores.
Check picker ↔ display coupling: changing the session picker actually changes ALL downstream rendered values (radar, table, composite score), not just some.

The A || B coupling bug class: Any latestSessionMqi || currentMqi pattern is a potential coupling failure: A may be correct for one UI state but wrong for another. After every API or schema change, grep:

grep -n '||' frontend/js/model-quality.js | grep -i 'session\|mqi\|current\|latest'

Test the coupling path explicitly:

Select “All (30-day mean)” in session picker → verify radar uses 30-day groupZ, not latest session
Select specific session → verify radar switches to that session’s groupZ
Switch back to “All” → verify radar reverts (no stale state)

Strength: Catches context coupling failures invisible to single-element golden-path tests.
Weakness: Requires knowing the intended coupling semantics. If the design intent is ambiguous, the test cannot distinguish “wrong coupling” from “intentional decoupling.”

Design Quality Lens

Goal: visual design meets professional quality standards across five sub-dimensions. Draws from the 12-dimension image eval rubric at topics/wiki-image-eval-rubric. Apply this lens to every chart, tooltip, table, and interactive control.

Strength: Objectifies visual quality with a weighted scoring rubric; prevents “looks fine to me” from blocking legitimate design bugs.
Weakness: The rubric was designed for static R viz exports, not interactive web components. Animated transitions, hover states, and scroll behavior are not captured. Screenshot evaluation is pixel-level, not semantic: a radar spoke at 30% height looks identical whether the value is +0.30 or -0.30.

D1: Chart Quality (draws from wiki-image-eval-rubric dimensions 1–8)

For every chart, evaluate:

Check	What to verify	Severity if failing
Accuracy & Scale	Y-axis baseline is 0 or clearly annotated; no truncated axes that exaggerate differences	CRITICAL
Clarity of Purpose	Chart title states the insight, not the chart type (“MQI dropped 15%” not “Radar Chart”)	MEDIUM
Chart Type Fit	Type matches data: radar for multi-dimensional comparison, line for temporal trends, bar for category comparison. Pie only for ≤5 slices summing to 100%	HIGH
Labels & Annotations	Axes have labels with units; legend present if >1 series; data points annotated where they tell a story	MEDIUM
Colorblind Safety	No red+green-only encoding; palette passes grayscale test	HIGH
Dual-axis discipline	See D2 below	HIGH
Overplotting	Dense scatterplots use alpha or hexbin; overlapping text avoided	MEDIUM
Visual Hierarchy	Color weight and font size direct attention to primary finding first	LOW

// Extract chart title for clarity check
document.querySelector('.chart-title, .recharts-label, svg title')?.textContent

// Check Y-axis baseline (SVG-based charts)
document.querySelector('.recharts-yAxis .recharts-cartesian-axis-tick-value')?.textContent
// If first tick is not "0" and chart is a bar/line, flag for truncation review

D2: Dual-Axis Charts

Dual-axis charts are among the most frequently misused chart types. Apply these rules whenever a chart has two Y-axes:

Rule	Rationale	Check
Visually distinct line styles	Left and right axes must use different line types (solid vs dashed)	Check SVG stroke-dasharray attributes
Different colors per axis	Each axis series should have its own color from the brand palette	Both series same color = CRITICAL
Legend labels both axes	Legend must identify which series uses which axis	Check legend text vs axis label text
Scale relationship disclosed	If scales are mismatched, a note should explain (e.g., “left: count, right: rate”)	Missing disclosure = HIGH
No scale cherry-picking	Right axis should not be scaled to visually minimize/exaggerate the secondary series	Requires domain review

// Check for dual-axis presence
document.querySelectorAll('.recharts-yAxis').length > 1
// If true, apply full D2 checklist

D3: Typography & Spacing

Check	Standard	How to verify	Severity
Minimum font size	≥ 11px rendered (not just CSS)	`getComputedStyle(el).fontSize`	HIGH
Line height	≥ 1.4 for body text, ≥ 1.2 for compact metric values	`getComputedStyle(el).lineHeight`	MEDIUM
Text truncation	Key metric values must not be truncated with ellipsis	Check `el.scrollWidth > el.clientWidth`	HIGH
Element padding	Clickable elements: min 8px padding on all sides	`getComputedStyle(el).padding`	MEDIUM
Label overlap	Text nodes within 2px of each other = collision	Compare getBoundingClientRect() of adjacent labels	HIGH
Z-index collision	Two positioned layers occupy same screen space	Check computed z-index for overlapping components	MEDIUM

// Font size audit on all metric value elements
Array.from(document.querySelectorAll('.metric-value, .badge, .score, td')).map(el => ({
  text: el.textContent.trim().slice(0, 20),
  fontSize: parseFloat(getComputedStyle(el).fontSize),
  truncated: el.scrollWidth > el.clientWidth
})).filter(x => x.fontSize < 11 || x.truncated)

// Label overlap detection for chart annotations
function hasOverlap(rects) {
  for (let i = 0; i < rects.length; i++)
    for (let j = i+1; j < rects.length; j++)
      if (rects[i].right > rects[j].left - 2 && rects[i].bottom > rects[j].top - 2
          && rects[i].left < rects[j].right && rects[i].top < rects[j].bottom)
        return true;
  return false;
}
const labelRects = Array.from(document.querySelectorAll('.recharts-label, .chart-label'))
  .map(el => el.getBoundingClientRect());
hasOverlap(labelRects) // true = file HIGH issue

For every tooltip, dropdown, and popover: trigger hover/click, then check viewport containment.

Strength: getBoundingClientRect() gives exact pixel positions; viewport overflow is objective.
Weakness: Only catches geometric overflow, not content accuracy inside the tooltip. A tooltip that appears inside the viewport but shows wrong data passes this check.

// After triggering a tooltip, check viewport containment
function tooltipSafe(tooltipEl) {
  const rect = tooltipEl.getBoundingClientRect();
  return {
    topOk:    rect.top >= 0,
    leftOk:   rect.left >= 0,
    bottomOk: rect.bottom <= window.innerHeight,
    rightOk:  rect.right <= window.innerWidth,
    safe: rect.top >= 0 && rect.left >= 0 &&
          rect.bottom <= window.innerHeight &&
          rect.right <= window.innerWidth
  };
}
// Use browser_hover to trigger, then browser_evaluate to check

Required hover test sequence per tooltip:

Hover element → tooltip appears
Run tooltipSafe() → all four bounds must pass
Move mouse away → tooltip disappears (no ghost tooltip)
Hover near right/bottom edge of viewport → tooltip should flip/reposition, not clip

Content standards (checked after geometric pass):

No bare numbers. Every numeric value in a tooltip must carry a label. 2,383 alone is a HIGH issue; Sessions: 2,383 is correct. Check every chart data-point tooltip, metric card hover, and badge popover.
Coverage requirement. Any dashboard element that benefits from a concise explanation must have a hover tooltip - charts, metric cards, badges, tier labels, model abbreviations, status icons. Flag missing tooltips as MEDIUM issues.
No em dashes. Tooltip text must never contain an em dash (-). Use a plain hyphen (-) or colon (:) as separators. Search tooltip text with tooltipEl.textContent.includes('\u2014') and flag any match as HIGH.

// Content-standard checks
function tooltipContentOk(tooltipEl) {
  const text = tooltipEl.textContent || '';
  const hasEmDash = text.includes('\u2014');
  // bare number: a token that is purely numeric with no adjacent label text
  const bareNumber = /(?:^|\s)[\d,]+(?:\s|$)/.test(text) && !/[A-Za-z]/.test(text.replace(/[\d,.\s%$]/g, ''));
  return { hasEmDash, bareNumber, ok: !hasEmDash && !bareNumber };
}

Phase 2: Ralph Loop Per Issue

Once the catalog is populated, run one ralph loop per issue (highest severity first):

ISSUE = pop next OPEN issue (CRITICAL first, then HIGH, MEDIUM, LOW)
loop:
  reproduce     → confirm symptom is still present in current build
  hypothesize   → classify root cause: data? compute? display? coupling? design? db?
  fix           → minimal upstream code change addressing root cause
  interaction-test → golden path + coupling path (see below)
  visual-check  → screenshot + read; confirm rendered value matches intent
  if verified:
    close ISSUE → mark CLOSED, record fix in catalog
  else:
    iterate     → refine hypothesis, try again (max 3 iterations before escalate)

Reproduce Step

Never code a fix without confirming the bug is still live:

// Extract current rendered value
document.querySelector('.mqi-composite-score, .metric-value')?.textContent

If symptom is gone, close as “already fixed.” Do not write a fix.

Fix Discipline

Fix the root cause, not the symptom. A display patch recurs the next data change.
Fix upstream first: backend bug → fix Rust; compute bug → fix JS function; display bug → fix binding.
Minimal change: exactly this issue. Do not refactor surrounding code.
No gate weakening: do not loosen a threshold or remove a validation to make a bug disappear. That is not a fix: it is a measurement demotion.

Interaction Test

Test the golden path and the coupling path for every fix:

Golden path: correct value renders for default state
Coupling path: related control changes (picker, date range, tab) still propagate correctly downstream
Design path (for design issues): verify the element meets the relevant D1–D4 criterion after fix

// Coupling path example (session picker fix):
// 1. Select "All" → radar must use 30-day groupZ
// 2. Select specific session → radar must switch to that session's groupZ
// 3. Switch back to "All" → radar reverts; no stale state

Visual Check

After passing the interaction test, take a screenshot and read it. Confirm:

Labels match computed values
Color bands correct (Error=red, Warning=yellow, Good=green)
Chart shape makes intuitive sense (low-thinking session → short Thinking spoke)
No layout collapse, overlapping text, or clipped elements
Tooltip stays within viewport
At 375px: no horizontal overflow

If the visual check fails despite the interaction test passing → bug is CSS/layout → file new MEDIUM design issue.

Phase 2b: DB/Cache Mutation Verification

Applies to: every button click, form submit, toggle, or picker that is supposed to write to the database.

Strength: Catches the most dangerous silent failure class: a UI action that returns 200 but doesn’t actually persist.
Weakness: API-only verification doesn’t catch DB-layer corruption, referential integrity failures, or race conditions from concurrent writes. For those, inspect the SQLite DB directly after the action.

Protocol per interactive mutation:

BEFORE:
  1. Record current DB state: GET /api/<relevant-endpoint> → save as before_state
  2. Note which fields should change after the action

ACTION:
  3. Perform the UI action (button click, form submit, toggle)
  4. Wait for network idle (all /api/* requests complete)

AFTER:
  5. GET /api/<same-endpoint> → save as after_state
  6. Diff: expected_changed_fields = after_state - before_state
  7. Check: all expected fields changed
  8. Check: no unexpected fields changed (no side effects)

CACHE CHECK:
  9. Hard-reload the page (Ctrl+Shift+R equivalent)
  10. GET /api/<endpoint> again → must still show after_state
     (confirms cache was invalidated, not just bypassed)

DOUBLE-SUBMIT GUARD:
  11. Click the button twice rapidly
  12. GET /api/<endpoint> → must show exactly one change, not two

Common silent DB failures to look for:

Symptom	Likely cause	Where to check
UI updates but page reload reverts	Browser state updated, DB write failed silently	Check API response body for error message
First click works, second doesn’t	Missing idempotency key / unique constraint	Check network response on second click
Form submit succeeds but field blank in DB	Frontend sends correct payload, backend ignores field	Check request payload vs DB schema
Cache serves stale data after write	Cache not invalidated on write path	Check Cache-Control / ETag headers; check SQLite WAL mode

// Capture XHR/fetch response bodies during mutations
// (run before performing the action)
const origFetch = window.fetch;
window._mutations = [];
window.fetch = async (...args) => {
  const resp = await origFetch(...args);
  if (args[0].includes('/api/') && ['POST','PUT','DELETE','PATCH']
      .some(m => (args[1]?.method || 'GET').toUpperCase() === m)) {
    const clone = resp.clone();
    clone.json().then(body => window._mutations.push({url: args[0], status: resp.status, body}));
  }
  return resp;
};
// After action: window._mutations[window._mutations.length - 1]

Phase 3: Verification Stack

From most reliable to least:

Layer	Tool	What it catches	What it misses	Required for
API contract	`browser_network_requests`	Schema drift, stale cache, missing fields	Display-time rendering errors	All issues
DB mutation	Fetch diff before/after	Silent write failures, double-submit	DB-level corruption, race conditions	CRITICAL + DB issues
Compute invariants	`evaluate_script` / JS	Math errors, weight mismatches	Context coupling bugs	CRITICAL + data issues
Interaction test	Playwright click + evaluate	Golden path, coupling path	Visual/CSS bugs	All issues
Viewport test	`browser_resize` + overflow check	Responsive breaks, touch targets	OS/font-rendering specific bugs	MEDIUM+ issues
Design quality	evaluate_script + screenshot	Font size, overlap, tooltip overflow, chart standards	Animation, semantics, content accuracy	ALL design issues
Screenshot	`browser_take_screenshot`	Layout collapse, obvious mismatches	Subtle color/font rendering, sign direction in radial charts	Every closed issue
Console logs	`browser_console_messages`	JS errors, unhandled rejections	Silent data errors	All routes

Coverage rules by severity:

CRITICAL: use ALL layers
HIGH: skip nothing; use all layers except viewport at 768px+
MEDIUM: skip Compute invariants if not data-related; always include Design quality
LOW: screenshot + design check sufficient

Weaknesses of This Approach

Log each weakness encountered in weakness_log. If the log is empty after a run, the stack wasn’t applied honestly.

You measure what you can specify. Silent statistical errors require knowing the correct formula. If the formula is wrong, the invariant check passes.
Golden dataset drift. New ingest shifts all Z-scores; a test that passed last run may fail for the right reason (not a bug).
Correlation vs causation. A z-score outlier may reflect a genuinely unusual session. Don’t fix outliers without checking raw session data.

Field presence ≠ field correctness. A field with the wrong unit, aggregation period, or rounding passes the API contract test. Combine with DS Lens.
Cache invalidation timing. Freshly restarted dev server may serve stale SQLite if ingest didn’t run. Always cargo run -p rusty-bloomnet-ingest before QA.
Double-submit not always caught by API diff. If the endpoint is idempotent by accident (UPSERT overwrites), the second click silently succeeds without creating a duplicate: passing the test while hiding a missing idempotency guard.

Context coupling requires multi-component reasoning. A single-element test misses the case where the radar renders correctly for the wrong context. Tests must include state-change sequences.
Circular confidence. You write the fix and the test. After fixing a CRITICAL issue, have a second agent (or the advisor) independently verify without seeing your hypothesis.
The A || B class. Any fallback pattern in display-binding code is a potential coupling bug. Grep for || in display-binding code after every schema change.
Agentic test pollution. Each evaluate_script call runs in the live page context. A test that mutates page state corrupts subsequent tests. Always reset to baseline state between tests.

Screenshot semantics. A radar spoke at 30% height looks identical whether the value is +0.30 or -0.30. Always pair visual checks with DOM value extraction.
12-dimension rubric designed for static R viz. Animated transitions, hover state timing, and scroll behavior are not scored. Extend the rubric if these matter.
Font rendering varies by OS/browser. A 11px font looks different on macOS Retina vs Windows ClearType. Test on the target OS; don’t assume dev machine rendering = production.
Colorblind check is binary. “No red+green encoding” catches the most common failure but doesn’t validate full WCAG contrast ratios. Use a contrast ratio checker for thorough accessibility.

4 breakpoints miss real-world fragmentation. Viewport widths of 320px (older iPhones), 414px (iPhone Plus), and 360px (Android) often reveal different overflow patterns than 375px.
Touch target size check is geometry-only. A 44×44px target that overlaps another 44×44px target still fails usability. Check for overlap between adjacent interactive elements.
Viewport tests don’t capture OS chrome. Browser toolbar + status bar reduce the actual usable viewport height. Test at full-page height, not just width.

Small corpus problem. With < 5 runs in history, pattern crystallization risks overfitting to a single unusual session. Minimum 2 occurrences before crystallizing any heuristic.
Skill update doesn’t automatically invalidate old tests. When a new check replaces an old one, the old check must be explicitly retired: otherwise the skill grows unbounded.
Run record is local. ~/.rusty-data/dashboard-qa-history.jsonl does not sync across machines. If QA runs happen on multiple machines, the corpus is fragmented.

Quality Checks

Every CRITICAL and HIGH issue has a closed coupling-path test. State-change sequence, not just golden-path.
API schema diff after any backend change. curl -s http://localhost:3001/api/mqi | jq 'keys' before and after; verify no fields disappeared.
Weighted-sum invariant passes. sum(w_i * z_i for all metrics) == compositeZ to 5 decimal places. Run after every scoring fix.
Console is clean. After full route navigation: 0 JS errors, 0 unhandled rejections.
All DB mutations verified. Every button/form that writes to the DB has a before/after fetch diff recorded.
Viewport check passed at 375px and 768px. No horizontal overflow, all key nav elements visible.
All tooltips/dropdowns within viewport. No viewport overflow at right/bottom edge of screen.
Screenshot taken and read for every closed issue. Not just captured: read. Wrong screenshot = issue not closed.
Design Quality Lens applied to all charts. Every chart checked against D1–D4 criteria.
Weakness log populated. Every run must produce ≥ 1 entry. Empty log = stack not applied honestly.
Run record appended. ~/.rusty-data/dashboard-qa-history.jsonl has a new entry after every completed run.

Phase 4: Self-Improving Loop

Pattern: Skill Crystallization (Pattern 2) + Metric Ratchet (Pattern 3) from skills/self-improving-agent-patterns.
Invariant: Improvements to the skill are locked: checks are never removed unless the full corpus shows zero signal (≥ 5 runs, 0 occurrences). New checks are added only when a pattern appears in ≥ 2 runs.

Strength: Skill compounds across projects and time; issue classes discovered in run N become systematic checks in run N+1 rather than rediscoveries.
Weakness: The corpus is local; the crystallization threshold (≥ 2 runs) is a heuristic, not a statistically grounded cutoff.

Step 1: Record This Run

After closing all issues, append a run record:

{
  "date": "2026-04-22T14:30:00Z",
  "app_url": "http://localhost:3001",
  "run_version": "260422b",
  "issues_found": 9,
  "issues_closed": 7,
  "issues_open": 2,
  "issue_breakdown": {
    "data": 3,
    "api": 1,
    "scoring": 2,
    "design": 2,
    "db": 1,
    "responsive": 0
  },
  "new_issue_types_seen": ["label_overlap", "dual_axis_missing_legend"],
  "new_weaknesses_discovered": ["tooltip_content_not_verified"],
  "checks_that_found_nothing": ["double_submit_guard", "touch_target_size"],
  "notes": "latestSessionMqi coupling bug found via AI-eng lens; weighted-sum verified to 5dp"
}

# Append run record
echo '{...}' >> ~/.rusty-data/dashboard-qa-history.jsonl

Step 2: Read the Corpus

Before proposing any skill update, read the entire history:

cat ~/.rusty-data/dashboard-qa-history.jsonl | jq -s '.'
# Count occurrences of each issue type across all runs
cat ~/.rusty-data/dashboard-qa-history.jsonl | jq -s '[.[].issue_breakdown | to_entries[]] | group_by(.key) | map({type: .[0].key, total: map(.value) | add, runs: length})'

Analyze:

Which issue types appear in ≥ 2 runs? → Candidates for systematic promotion to Phase 1 checklist
Which checks found nothing in ≥ 5 runs? → Candidates for retirement or deprioritization
Which new weaknesses appeared in ≥ 2 runs? → Candidates for addition to Weaknesses section
Which fix patterns recurred? → Candidates for codification in the Fix Discipline section

Step 3: Propose Skill Updates (GEPA Protocol)

G: Generate: Run the QA loop (Phases 0–3).
E: Evaluate: Compare this run against corpus. Answer: what did this run discover that wasn’t in the skill? What did the skill predict correctly?
P: Propose: Write a concrete update proposal. Each proposal has:

type: add_check | retire_check | add_weakness | update_fix_discipline | update_phase
evidence: which runs support this (list of dates + brief description)
change: the exact text to add or remove
threshold_met: true if ≥ 2 run evidence

A: Accept: Do NOT apply the update automatically. Present proposals to the user. Accept only proposals where threshold_met: true. After acceptance, apply as a Changelog entry and update the skill file.

Example proposal (threshold met):

  type: add_check
  evidence: runs 2026-04-22 and 2026-04-29 both found label_overlap in chart annotations
  change: Add to D3 Typography checks: "Chart annotation labels must not overlap.
          Verify with getBoundingClientRect() on all .recharts-label elements."
  threshold_met: true
  
Example proposal (threshold NOT met):

  type: add_check
  evidence: only 2026-04-22 found dual_axis_missing_legend
  change: Add dual-axis legend requirement to D2
  threshold_met: false  ← present to user but do not auto-apply

Step 4: Monotonic Invariant

The skill may only improve, never degrade:

Never remove a check unless: (a) ≥ 5 runs with zero signal AND (b) user explicitly authorizes removal.
Never weaken a threshold (e.g., changing ”≥ 44px” to ”≥ 32px”) unless the original threshold is demonstrably wrong.
Lock all accepted improvements. Once a check is added, it becomes part of all future runs.

The monotonic invariant prevents the common failure mode where a skill gets progressively simplified until it’s useless. Every added check should make future QA runs more thorough, not just longer.

Provenance

Initial creation 2026-04-22: seven MQI v3 scoring bugs found and fixed via the DS/DE/AI-eng three-lens approach in rusty-bloomnet. The latestSessionMqi || currentMqi coupling bug (model-quality.js

) was discovered via the AI engineer lens: invisible to golden-path tests.

Upgrade 2026-04-22b: Design Quality Lens draws from topics/wiki-image-eval-rubric 12-dimension rubric (developed 2026-04-03 for vault wiki visualization scoring). DB/cache mutation verification protocol drawn from a pattern observed across multiple jobs-apply and SKL audit sessions where buttons returned 200 but silently failed to write. Self-improving loop uses Pattern 2 (Skill Crystallization) + Pattern 3 (Metric Ratchet) from skills/self-improving-agent-patterns, adapted for QA skill evolution rather than metric optimization.

Ralph loop adaptation: the original definitions/ralph-loop iterates over spec files to build new components. Here, it iterates over issues to fix existing ones. The invariants carry over: single-issue scoping (one issue per loop, no scope creep), error tolerance (a failed fix does not abort the catalog), throwaway hypothesis (the first hypothesis is usually wrong; commit only after the interaction test passes).

Cross-Pattern Context

skills/karpathy-ratchet: use when a scoring metric plateaus; the ratchet optimizes a tunable config, the QA loop fixes correctness bugs. Different targets, same measurement discipline.
skills/self-improving-agent-patterns: Pattern 2 (Skill Crystallization) and Pattern 3 (Metric Ratchet) underpin Phase 4 of this skill.
definitions/ralph-loop: the original autonomous build loop adapted here for issue-driven QA iteration.
skills/debug-capture-system: for capturing console/screenshot evidence during the visual check phase.
topics/wiki-image-eval-rubric: 12-dimension rubric adapted for the Design Quality Lens (Phase 1d).
topics/visual-output-routing: routing framework when the issue involves chart type mismatch (R viz vs Figma).
topics/brand-token-chart-families: brand token system; charts should draw from brand$ tokens, not hardcoded colors.

Visual Enrichment

Finding Type	Tool	Specification
Issue catalog by severity + category	R viz (skills/r-visualization-pipeline)	Family: `CMP` grouped bar, Template: Journal
Issue close rate across runs (ratchet)	R viz (skills/r-visualization-pipeline)	Family: `TS` multiline, Template: Journal
Verification stack coverage map	Figma MCP (`generate_diagram`)	Type: Matrix/flowchart: issues × layers
Corpus issue frequency heatmap	R viz (skills/r-visualization-pipeline)	Family: `COR` heatmap, Template: Journal