dashboard-qa-loop
User wants to QA a running web dashboard: find issues, interaction-test each element, and run fix-verify loops until the dashboard is clean
Changelog
260423: Hover content standards
- D4: added content-standard rules: no bare numbers, coverage requirement, no em dashes
- D4: added
tooltipContentOk()helper for em-dash and bare-number detection - Version bump to 260423
260422b: Major upgrade (self-improving + design quality + DB mutation + viewport)
- Added Phase 0b: Viewport & Responsive Survey (4 breakpoints, overflow checks)
- Added Phase 1d: Design Quality Lens (visual quality, typography, hover, dual-axis, colorblind)
- Added Phase 2b: DB/Cache Mutation Verification (button/form → DB diff protocol)
- Added Phase 4: Self-Improving Loop (GEPA skill crystallization, corpus-aware, ~/.rusty-data/dashboard-qa-history.jsonl)
- Added run_record output, design_issues output
- Added browser_resize + browser_hover to mounted_allowed_tools
- Design quality lens draws from topics/wiki-image-eval-rubric 12-dimension rubric
- Self-improving pattern uses Pattern 2 (Skill Crystallization) + Pattern 3 (Metric Ratchet) from skills/self-improving-agent-patterns
- Expanded weaknesses with design + self-improvement blind spots
260422a: Initial creation
- Codifies the dashboard QA workflow developed during MQI v3 scoring audit (rusty-bloomnet)
- Ralph loop adapted from definitions/ralph-loop for issue-driven QA iteration
- Incorporates DS/DE/AI-eng weakness taxonomy from session 2026-04-22
Description
A systematic workflow for auditing a running web dashboard across four dimensions: data correctness, API contract, scoring logic, and design quality. Discovers all interactive elements, tests responsiveness at multiple viewport sizes, catalogs issues with severity, then runs a ralph loop around each: reproduce → hypothesize → fix → interaction-test → visual-check → close or iterate.
At the end of every run, the self-improving loop appends a run record to a persistent history file and proposes skill updates based on the entire corpus of past runs: not just the current session.
Core insight: Most dashboard bugs are context coupling failures (rendered value correct in isolation, wrong given surrounding state) or silent DB failures (button click returns 200 but the DB state did not change). Both classes are invisible to single-element tests; only cross-component interaction + DB diff tests expose them.
Interface
Trigger: “QA the dashboard”, “run interaction tests”, “find issues in the UI”, “ralph loops”, or after any schema/API/scoring/frontend change.
Inputs:
app_url: running app URL (e.g.http://localhost:3001)api_prefix: API base path (e.g./api)element_map: optional list of panels/routes to audit; if absent, do full traversal
Outputs:
issue_catalog: structured list of all found issues with severity + hypothesisclosed_issues: issues fixed and verifiedopen_issues: remaining after max iterationsweakness_log: categories of bugs the test stack cannot reliably catchdesign_issues: subset of catalog: typography, spacing, hover, viewport, chart quality violationsrun_record: JSON entry appended to~/.rusty-data/dashboard-qa-history.jsonl
Phase 0: Element Discovery
Build the full element map before cataloging issues. Navigate every route and enumerate:
| Layer | What to capture |
|---|---|
| Navigation | All tabs, routes, breadcrumb states |
| Controls | Pickers, date ranges, toggles, dropdowns |
| Charts | Chart type, axes, legend items, tooltip behavior |
| Panels | Panel headers, metric labels, badge values |
| Forms | Input fields, submit buttons, validation states |
| API endpoints | Every /api/* call triggered by each view |
| Console | JS errors and warnings emitted during navigation |
Scope rule: If element_map is provided, skip to Phase 0b. If absent, enumerate exhaustively: every interactive element is a potential issue surface.
// Enumerate all interactive elements on a page
[...document.querySelectorAll('button, input, select, [role="tab"], [data-picker]')]
.map(el => ({ tag: el.tagName, text: el.textContent.trim().slice(0, 40), id: el.id }))
Phase 0b: Viewport & Responsive Survey
Strength: Catches responsive bugs invisible at a single viewport. Most dashboards are built on a 1440px monitor and never tested at 375px.
Weakness: Multiple viewport tests multiply runtime; some bugs only appear at specific OS/browser combinations (font rendering, scrollbar widths).
Test at four breakpoints using browser_resize. At each breakpoint, take a screenshot and run the overflow check.
| Breakpoint | Width | Represents |
|---|---|---|
| Mobile | 375px | iPhone SE / smallest common |
| Tablet | 768px | iPad portrait |
| Desktop | 1280px | Laptop (common dev size) |
| Wide | 1440px | Standard monitor baseline |
Checks at each breakpoint:
// 1. Page-level horizontal overflow (no sideways scroll)
document.body.scrollWidth > window.innerWidth
// TRUE = overflow exists → file MEDIUM issue
// 2. Chart overflow: all charts contained within their parent
Array.from(document.querySelectorAll('canvas, svg, .chart')).map(el => {
const rect = el.getBoundingClientRect();
const parent = el.parentElement.getBoundingClientRect();
return { el: el.className, overflow: rect.right > parent.right + 4 };
})
// 3. Navigation accessible: key nav elements visible and not clipped
Array.from(document.querySelectorAll('nav a, [role="tab"]')).map(el => {
const rect = el.getBoundingClientRect();
return { text: el.textContent.trim(), visible: rect.width > 0 && rect.height > 0 };
})
// 4. Touch target size: interactive elements ≥ 44px × 44px (WCAG 2.5.5)
Array.from(document.querySelectorAll('button, a, [role="tab"], input')).map(el => {
const rect = el.getBoundingClientRect();
return { text: el.textContent.trim().slice(0,30), w: rect.width, h: rect.height,
fail: rect.width < 44 || rect.height < 44 };
}).filter(x => x.fail)
Issue severity by viewport:
- Overflow at 375px → HIGH (affects real mobile users)
- Overflow at 768px → MEDIUM
- Overflow at 1280px+ → LOW (dev machine size; unlikely in production)
- Touch target < 44px → MEDIUM at any size
Phase 1: Issue Catalog
For each element, run a structured inspection. Output a catalog entry for every anomaly:
Issue #{id}
severity: [CRITICAL | HIGH | MEDIUM | LOW]
category: [data | api | scoring | design | db | responsive]
element: <panel or component name>
symptom: <what looks wrong>
hypothesis: <root cause: data, compute, display, coupling, or design?>
api_check: <which /api/* endpoint to verify first>
viewport: <at which breakpoint seen; "all" if always present>
status: OPEN
Severity classification:
| Severity | Meaning |
|---|---|
| CRITICAL | Wrong data shown; user makes incorrect decisions |
| HIGH | Correct data, wrong context / responsive break at common size |
| MEDIUM | Visual/UX inconsistency; design violation; small viewport break |
| LOW | Cosmetic, aspirational, or rare-condition improvement |
Four-Lens Inspection
Run each element through all four lenses. Each finds bugs the others miss.
DS Lens: Data Accuracy
Goal: rendered numbers match independently computed values from raw API data.
- Fetch raw API response for the relevant endpoint.
- Manually compute the expected rendered value (e.g., run
groupZby hand). - Extract rendered DOM value via
evaluate_script/browser_evaluate. - Diff:
rendered_value == expected_computed_value.
Critical invariants:
- Weighted-sum:
sum(w_i * z_i for all metrics) == compositeZ(5 decimal precision) - groupZ: each group’s radar spoke =
sum(z_i * w_i for members) / sum(w_i for members) - Threshold crossings: metric at z=-2.1 renders in Error band, not Warning
- Baseline vs current: “base” column values drawn from same session/period as “current”?
Strength: Catches silent math errors that golden-path visual checks completely miss.
Weakness: You must know the correct formula in advance. If the formula itself is wrong, the invariant check passes. You measure what you can specify.
DE Lens: API Contract
Goal: API response shape and freshness match what the frontend expects.
- Capture all network requests during navigation (
browser_network_requests). - For each
/api/*call: check HTTP status, Content-Type, response schema fields. - Verify no stale cache: compare response timestamp against last ingest run timestamp.
- Check added/removed fields don’t silently produce
undefinedin the frontend.
# Schema diff: before and after any backend change
curl -s http://localhost:3001/api/mqi | jq 'keys'
# Compare output across two runs; missing keys = CRITICAL
Critical rule: HTTP 200 != dashboard green. A 200 with a wrong schema silently produces NaN or 0 in the UI. Always click every panel after API/schema changes.
Strength: Catches schema drift before it reaches users.
Weakness: API field presence ≠ API field correctness. A field present with a wrong value (wrong unit, wrong aggregation period) passes the contract test. Combine with DS Lens to verify values, not just shape.
AI Engineer Lens: Scoring Invariants
Goal: scoring system internal consistency.
- Verify metric weights sum to 1.0 across all groups.
- Verify each metric is a member of exactly one group (no double-counting, no orphans).
- Verify baseline source is stable: same session selection → same baseline z-scores.
- Check picker ↔ display coupling: changing the session picker actually changes ALL downstream rendered values (radar, table, composite score), not just some.
The A || B coupling bug class: Any latestSessionMqi || currentMqi pattern is a potential coupling failure: A may be correct for one UI state but wrong for another. After every API or schema change, grep:
grep -n '||' frontend/js/model-quality.js | grep -i 'session\|mqi\|current\|latest'
Test the coupling path explicitly:
- Select “All (30-day mean)” in session picker → verify radar uses 30-day groupZ, not latest session
- Select specific session → verify radar switches to that session’s groupZ
- Switch back to “All” → verify radar reverts (no stale state)
Strength: Catches context coupling failures invisible to single-element golden-path tests.
Weakness: Requires knowing the intended coupling semantics. If the design intent is ambiguous, the test cannot distinguish “wrong coupling” from “intentional decoupling.”
Design Quality Lens
Goal: visual design meets professional quality standards across five sub-dimensions. Draws from the 12-dimension image eval rubric at topics/wiki-image-eval-rubric. Apply this lens to every chart, tooltip, table, and interactive control.
Strength: Objectifies visual quality with a weighted scoring rubric; prevents “looks fine to me” from blocking legitimate design bugs.
Weakness: The rubric was designed for static R viz exports, not interactive web components. Animated transitions, hover states, and scroll behavior are not captured. Screenshot evaluation is pixel-level, not semantic: a radar spoke at 30% height looks identical whether the value is +0.30 or -0.30.
D1: Chart Quality (draws from wiki-image-eval-rubric dimensions 1–8)
For every chart, evaluate:
| Check | What to verify | Severity if failing |
|---|---|---|
| Accuracy & Scale | Y-axis baseline is 0 or clearly annotated; no truncated axes that exaggerate differences | CRITICAL |
| Clarity of Purpose | Chart title states the insight, not the chart type (“MQI dropped 15%” not “Radar Chart”) | MEDIUM |
| Chart Type Fit | Type matches data: radar for multi-dimensional comparison, line for temporal trends, bar for category comparison. Pie only for ≤5 slices summing to 100% | HIGH |
| Labels & Annotations | Axes have labels with units; legend present if >1 series; data points annotated where they tell a story | MEDIUM |
| Colorblind Safety | No red+green-only encoding; palette passes grayscale test | HIGH |
| Dual-axis discipline | See D2 below | HIGH |
| Overplotting | Dense scatterplots use alpha or hexbin; overlapping text avoided | MEDIUM |
| Visual Hierarchy | Color weight and font size direct attention to primary finding first | LOW |
// Extract chart title for clarity check
document.querySelector('.chart-title, .recharts-label, svg title')?.textContent
// Check Y-axis baseline (SVG-based charts)
document.querySelector('.recharts-yAxis .recharts-cartesian-axis-tick-value')?.textContent
// If first tick is not "0" and chart is a bar/line, flag for truncation review
D2: Dual-Axis Charts
Dual-axis charts are among the most frequently misused chart types. Apply these rules whenever a chart has two Y-axes:
| Rule | Rationale | Check |
|---|---|---|
| Visually distinct line styles | Left and right axes must use different line types (solid vs dashed) | Check SVG stroke-dasharray attributes |
| Different colors per axis | Each axis series should have its own color from the brand palette | Both series same color = CRITICAL |
| Legend labels both axes | Legend must identify which series uses which axis | Check legend text vs axis label text |
| Scale relationship disclosed | If scales are mismatched, a note should explain (e.g., “left: count, right: rate”) | Missing disclosure = HIGH |
| No scale cherry-picking | Right axis should not be scaled to visually minimize/exaggerate the secondary series | Requires domain review |
// Check for dual-axis presence
document.querySelectorAll('.recharts-yAxis').length > 1
// If true, apply full D2 checklist
D3: Typography & Spacing
| Check | Standard | How to verify | Severity |
|---|---|---|---|
| Minimum font size | ≥ 11px rendered (not just CSS) | getComputedStyle(el).fontSize | HIGH |
| Line height | ≥ 1.4 for body text, ≥ 1.2 for compact metric values | getComputedStyle(el).lineHeight | MEDIUM |
| Text truncation | Key metric values must not be truncated with ellipsis | Check el.scrollWidth > el.clientWidth | HIGH |
| Element padding | Clickable elements: min 8px padding on all sides | getComputedStyle(el).padding | MEDIUM |
| Label overlap | Text nodes within 2px of each other = collision | Compare getBoundingClientRect() of adjacent labels | HIGH |
| Z-index collision | Two positioned layers occupy same screen space | Check computed z-index for overlapping components | MEDIUM |
// Font size audit on all metric value elements
Array.from(document.querySelectorAll('.metric-value, .badge, .score, td')).map(el => ({
text: el.textContent.trim().slice(0, 20),
fontSize: parseFloat(getComputedStyle(el).fontSize),
truncated: el.scrollWidth > el.clientWidth
})).filter(x => x.fontSize < 11 || x.truncated)
// Label overlap detection for chart annotations
function hasOverlap(rects) {
for (let i = 0; i < rects.length; i++)
for (let j = i+1; j < rects.length; j++)
if (rects[i].right > rects[j].left - 2 && rects[i].bottom > rects[j].top - 2
&& rects[i].left < rects[j].right && rects[i].top < rects[j].bottom)
return true;
return false;
}
const labelRects = Array.from(document.querySelectorAll('.recharts-label, .chart-label'))
.map(el => el.getBoundingClientRect());
hasOverlap(labelRects) // true = file HIGH issue
D4: Hover States & Tooltip Viewport Safety
For every tooltip, dropdown, and popover: trigger hover/click, then check viewport containment.
Strength: getBoundingClientRect() gives exact pixel positions; viewport overflow is objective.
Weakness: Only catches geometric overflow, not content accuracy inside the tooltip. A tooltip that appears inside the viewport but shows wrong data passes this check.
// After triggering a tooltip, check viewport containment
function tooltipSafe(tooltipEl) {
const rect = tooltipEl.getBoundingClientRect();
return {
topOk: rect.top >= 0,
leftOk: rect.left >= 0,
bottomOk: rect.bottom <= window.innerHeight,
rightOk: rect.right <= window.innerWidth,
safe: rect.top >= 0 && rect.left >= 0 &&
rect.bottom <= window.innerHeight &&
rect.right <= window.innerWidth
};
}
// Use browser_hover to trigger, then browser_evaluate to check
Required hover test sequence per tooltip:
- Hover element → tooltip appears
- Run
tooltipSafe()→ all four bounds must pass - Move mouse away → tooltip disappears (no ghost tooltip)
- Hover near right/bottom edge of viewport → tooltip should flip/reposition, not clip
Content standards (checked after geometric pass):
- No bare numbers. Every numeric value in a tooltip must carry a label.
2,383alone is a HIGH issue;Sessions: 2,383is correct. Check every chart data-point tooltip, metric card hover, and badge popover. - Coverage requirement. Any dashboard element that benefits from a concise explanation must have a hover tooltip - charts, metric cards, badges, tier labels, model abbreviations, status icons. Flag missing tooltips as MEDIUM issues.
- No em dashes. Tooltip text must never contain an em dash (-). Use a plain hyphen (-) or colon (:) as separators. Search tooltip text with
tooltipEl.textContent.includes('\u2014')and flag any match as HIGH.
// Content-standard checks
function tooltipContentOk(tooltipEl) {
const text = tooltipEl.textContent || '';
const hasEmDash = text.includes('\u2014');
// bare number: a token that is purely numeric with no adjacent label text
const bareNumber = /(?:^|\s)[\d,]+(?:\s|$)/.test(text) && !/[A-Za-z]/.test(text.replace(/[\d,.\s%$]/g, ''));
return { hasEmDash, bareNumber, ok: !hasEmDash && !bareNumber };
}
Phase 2: Ralph Loop Per Issue
Once the catalog is populated, run one ralph loop per issue (highest severity first):
ISSUE = pop next OPEN issue (CRITICAL first, then HIGH, MEDIUM, LOW)
loop:
reproduce → confirm symptom is still present in current build
hypothesize → classify root cause: data? compute? display? coupling? design? db?
fix → minimal upstream code change addressing root cause
interaction-test → golden path + coupling path (see below)
visual-check → screenshot + read; confirm rendered value matches intent
if verified:
close ISSUE → mark CLOSED, record fix in catalog
else:
iterate → refine hypothesis, try again (max 3 iterations before escalate)
Reproduce Step
Never code a fix without confirming the bug is still live:
// Extract current rendered value
document.querySelector('.mqi-composite-score, .metric-value')?.textContent
If symptom is gone, close as “already fixed.” Do not write a fix.
Fix Discipline
- Fix the root cause, not the symptom. A display patch recurs the next data change.
- Fix upstream first: backend bug → fix Rust; compute bug → fix JS function; display bug → fix binding.
- Minimal change: exactly this issue. Do not refactor surrounding code.
- No gate weakening: do not loosen a threshold or remove a validation to make a bug disappear. That is not a fix: it is a measurement demotion.
Interaction Test
Test the golden path and the coupling path for every fix:
- Golden path: correct value renders for default state
- Coupling path: related control changes (picker, date range, tab) still propagate correctly downstream
- Design path (for design issues): verify the element meets the relevant D1–D4 criterion after fix
// Coupling path example (session picker fix):
// 1. Select "All" → radar must use 30-day groupZ
// 2. Select specific session → radar must switch to that session's groupZ
// 3. Switch back to "All" → radar reverts; no stale state
Visual Check
After passing the interaction test, take a screenshot and read it. Confirm:
- Labels match computed values
- Color bands correct (Error=red, Warning=yellow, Good=green)
- Chart shape makes intuitive sense (low-thinking session → short Thinking spoke)
- No layout collapse, overlapping text, or clipped elements
- Tooltip stays within viewport
- At 375px: no horizontal overflow
If the visual check fails despite the interaction test passing → bug is CSS/layout → file new MEDIUM design issue.
Phase 2b: DB/Cache Mutation Verification
Applies to: every button click, form submit, toggle, or picker that is supposed to write to the database.
Strength: Catches the most dangerous silent failure class: a UI action that returns 200 but doesn’t actually persist.
Weakness: API-only verification doesn’t catch DB-layer corruption, referential integrity failures, or race conditions from concurrent writes. For those, inspect the SQLite DB directly after the action.
Protocol per interactive mutation:
BEFORE:
1. Record current DB state: GET /api/<relevant-endpoint> → save as before_state
2. Note which fields should change after the action
ACTION:
3. Perform the UI action (button click, form submit, toggle)
4. Wait for network idle (all /api/* requests complete)
AFTER:
5. GET /api/<same-endpoint> → save as after_state
6. Diff: expected_changed_fields = after_state - before_state
7. Check: all expected fields changed
8. Check: no unexpected fields changed (no side effects)
CACHE CHECK:
9. Hard-reload the page (Ctrl+Shift+R equivalent)
10. GET /api/<endpoint> again → must still show after_state
(confirms cache was invalidated, not just bypassed)
DOUBLE-SUBMIT GUARD:
11. Click the button twice rapidly
12. GET /api/<endpoint> → must show exactly one change, not two
Common silent DB failures to look for:
| Symptom | Likely cause | Where to check |
|---|---|---|
| UI updates but page reload reverts | Browser state updated, DB write failed silently | Check API response body for error message |
| First click works, second doesn’t | Missing idempotency key / unique constraint | Check network response on second click |
| Form submit succeeds but field blank in DB | Frontend sends correct payload, backend ignores field | Check request payload vs DB schema |
| Cache serves stale data after write | Cache not invalidated on write path | Check Cache-Control / ETag headers; check SQLite WAL mode |
// Capture XHR/fetch response bodies during mutations
// (run before performing the action)
const origFetch = window.fetch;
window._mutations = [];
window.fetch = async (...args) => {
const resp = await origFetch(...args);
if (args[0].includes('/api/') && ['POST','PUT','DELETE','PATCH']
.some(m => (args[1]?.method || 'GET').toUpperCase() === m)) {
const clone = resp.clone();
clone.json().then(body => window._mutations.push({url: args[0], status: resp.status, body}));
}
return resp;
};
// After action: window._mutations[window._mutations.length - 1]
Phase 3: Verification Stack
From most reliable to least:
| Layer | Tool | What it catches | What it misses | Required for |
|---|---|---|---|---|
| API contract | browser_network_requests | Schema drift, stale cache, missing fields | Display-time rendering errors | All issues |
| DB mutation | Fetch diff before/after | Silent write failures, double-submit | DB-level corruption, race conditions | CRITICAL + DB issues |
| Compute invariants | evaluate_script / JS | Math errors, weight mismatches | Context coupling bugs | CRITICAL + data issues |
| Interaction test | Playwright click + evaluate | Golden path, coupling path | Visual/CSS bugs | All issues |
| Viewport test | browser_resize + overflow check | Responsive breaks, touch targets | OS/font-rendering specific bugs | MEDIUM+ issues |
| Design quality | evaluate_script + screenshot | Font size, overlap, tooltip overflow, chart standards | Animation, semantics, content accuracy | ALL design issues |
| Screenshot | browser_take_screenshot | Layout collapse, obvious mismatches | Subtle color/font rendering, sign direction in radial charts | Every closed issue |
| Console logs | browser_console_messages | JS errors, unhandled rejections | Silent data errors | All routes |
Coverage rules by severity:
- CRITICAL: use ALL layers
- HIGH: skip nothing; use all layers except viewport at 768px+
- MEDIUM: skip Compute invariants if not data-related; always include Design quality
- LOW: screenshot + design check sufficient
Weaknesses of This Approach
Log each weakness encountered in weakness_log. If the log is empty after a run, the stack wasn’t applied honestly.
DS Blind Spots
- You measure what you can specify. Silent statistical errors require knowing the correct formula. If the formula is wrong, the invariant check passes.
- Golden dataset drift. New ingest shifts all Z-scores; a test that passed last run may fail for the right reason (not a bug).
- Correlation vs causation. A z-score outlier may reflect a genuinely unusual session. Don’t fix outliers without checking raw session data.
DE Blind Spots
- Field presence ≠ field correctness. A field with the wrong unit, aggregation period, or rounding passes the API contract test. Combine with DS Lens.
- Cache invalidation timing. Freshly restarted dev server may serve stale SQLite if ingest didn’t run. Always
cargo run -p rusty-bloomnet-ingestbefore QA. - Double-submit not always caught by API diff. If the endpoint is idempotent by accident (UPSERT overwrites), the second click silently succeeds without creating a duplicate: passing the test while hiding a missing idempotency guard.
AI Engineer Blind Spots
- Context coupling requires multi-component reasoning. A single-element test misses the case where the radar renders correctly for the wrong context. Tests must include state-change sequences.
- Circular confidence. You write the fix and the test. After fixing a CRITICAL issue, have a second agent (or the advisor) independently verify without seeing your hypothesis.
- The
A || Bclass. Any fallback pattern in display-binding code is a potential coupling bug. Grep for||in display-binding code after every schema change. - Agentic test pollution. Each
evaluate_scriptcall runs in the live page context. A test that mutates page state corrupts subsequent tests. Always reset to baseline state between tests.
Design Quality Blind Spots
- Screenshot semantics. A radar spoke at 30% height looks identical whether the value is +0.30 or -0.30. Always pair visual checks with DOM value extraction.
- 12-dimension rubric designed for static R viz. Animated transitions, hover state timing, and scroll behavior are not scored. Extend the rubric if these matter.
- Font rendering varies by OS/browser. A 11px font looks different on macOS Retina vs Windows ClearType. Test on the target OS; don’t assume dev machine rendering = production.
- Colorblind check is binary. “No red+green encoding” catches the most common failure but doesn’t validate full WCAG contrast ratios. Use a contrast ratio checker for thorough accessibility.
Responsive Design Blind Spots
- 4 breakpoints miss real-world fragmentation. Viewport widths of 320px (older iPhones), 414px (iPhone Plus), and 360px (Android) often reveal different overflow patterns than 375px.
- Touch target size check is geometry-only. A 44×44px target that overlaps another 44×44px target still fails usability. Check for overlap between adjacent interactive elements.
- Viewport tests don’t capture OS chrome. Browser toolbar + status bar reduce the actual usable viewport height. Test at full-page height, not just width.
Self-Improvement Blind Spots
- Small corpus problem. With < 5 runs in history, pattern crystallization risks overfitting to a single unusual session. Minimum 2 occurrences before crystallizing any heuristic.
- Skill update doesn’t automatically invalidate old tests. When a new check replaces an old one, the old check must be explicitly retired: otherwise the skill grows unbounded.
- Run record is local.
~/.rusty-data/dashboard-qa-history.jsonldoes not sync across machines. If QA runs happen on multiple machines, the corpus is fragmented.
Quality Checks
- Every CRITICAL and HIGH issue has a closed coupling-path test. State-change sequence, not just golden-path.
- API schema diff after any backend change.
curl -s http://localhost:3001/api/mqi | jq 'keys'before and after; verify no fields disappeared. - Weighted-sum invariant passes.
sum(w_i * z_i for all metrics) == compositeZto 5 decimal places. Run after every scoring fix. - Console is clean. After full route navigation: 0 JS errors, 0 unhandled rejections.
- All DB mutations verified. Every button/form that writes to the DB has a before/after fetch diff recorded.
- Viewport check passed at 375px and 768px. No horizontal overflow, all key nav elements visible.
- All tooltips/dropdowns within viewport. No viewport overflow at right/bottom edge of screen.
- Screenshot taken and read for every closed issue. Not just captured: read. Wrong screenshot = issue not closed.
- Design Quality Lens applied to all charts. Every chart checked against D1–D4 criteria.
- Weakness log populated. Every run must produce ≥ 1 entry. Empty log = stack not applied honestly.
- Run record appended.
~/.rusty-data/dashboard-qa-history.jsonlhas a new entry after every completed run.
Phase 4: Self-Improving Loop
Pattern: Skill Crystallization (Pattern 2) + Metric Ratchet (Pattern 3) from skills/self-improving-agent-patterns.
Invariant: Improvements to the skill are locked: checks are never removed unless the full corpus shows zero signal (≥ 5 runs, 0 occurrences). New checks are added only when a pattern appears in ≥ 2 runs.
Strength: Skill compounds across projects and time; issue classes discovered in run N become systematic checks in run N+1 rather than rediscoveries.
Weakness: The corpus is local; the crystallization threshold (≥ 2 runs) is a heuristic, not a statistically grounded cutoff.
Step 1: Record This Run
After closing all issues, append a run record:
{
"date": "2026-04-22T14:30:00Z",
"app_url": "http://localhost:3001",
"run_version": "260422b",
"issues_found": 9,
"issues_closed": 7,
"issues_open": 2,
"issue_breakdown": {
"data": 3,
"api": 1,
"scoring": 2,
"design": 2,
"db": 1,
"responsive": 0
},
"new_issue_types_seen": ["label_overlap", "dual_axis_missing_legend"],
"new_weaknesses_discovered": ["tooltip_content_not_verified"],
"checks_that_found_nothing": ["double_submit_guard", "touch_target_size"],
"notes": "latestSessionMqi coupling bug found via AI-eng lens; weighted-sum verified to 5dp"
}
# Append run record
echo '{...}' >> ~/.rusty-data/dashboard-qa-history.jsonl
Step 2: Read the Corpus
Before proposing any skill update, read the entire history:
cat ~/.rusty-data/dashboard-qa-history.jsonl | jq -s '.'
# Count occurrences of each issue type across all runs
cat ~/.rusty-data/dashboard-qa-history.jsonl | jq -s '[.[].issue_breakdown | to_entries[]] | group_by(.key) | map({type: .[0].key, total: map(.value) | add, runs: length})'
Analyze:
- Which issue types appear in ≥ 2 runs? → Candidates for systematic promotion to Phase 1 checklist
- Which checks found nothing in ≥ 5 runs? → Candidates for retirement or deprioritization
- Which new weaknesses appeared in ≥ 2 runs? → Candidates for addition to Weaknesses section
- Which fix patterns recurred? → Candidates for codification in the Fix Discipline section
Step 3: Propose Skill Updates (GEPA Protocol)
G: Generate: Run the QA loop (Phases 0–3).
E: Evaluate: Compare this run against corpus. Answer: what did this run discover that wasn’t in the skill? What did the skill predict correctly?
P: Propose: Write a concrete update proposal. Each proposal has:
type: add_check | retire_check | add_weakness | update_fix_discipline | update_phaseevidence: which runs support this (list of dates + brief description)change: the exact text to add or removethreshold_met: true if ≥ 2 run evidence
A: Accept: Do NOT apply the update automatically. Present proposals to the user. Accept only proposals where threshold_met: true. After acceptance, apply as a Changelog entry and update the skill file.
Example proposal (threshold met):
type: add_check
evidence: runs 2026-04-22 and 2026-04-29 both found label_overlap in chart annotations
change: Add to D3 Typography checks: "Chart annotation labels must not overlap.
Verify with getBoundingClientRect() on all .recharts-label elements."
threshold_met: true
Example proposal (threshold NOT met):
type: add_check
evidence: only 2026-04-22 found dual_axis_missing_legend
change: Add dual-axis legend requirement to D2
threshold_met: false ← present to user but do not auto-apply
Step 4: Monotonic Invariant
The skill may only improve, never degrade:
- Never remove a check unless: (a) ≥ 5 runs with zero signal AND (b) user explicitly authorizes removal.
- Never weaken a threshold (e.g., changing ”≥ 44px” to ”≥ 32px”) unless the original threshold is demonstrably wrong.
- Lock all accepted improvements. Once a check is added, it becomes part of all future runs.
The monotonic invariant prevents the common failure mode where a skill gets progressively simplified until it’s useless. Every added check should make future QA runs more thorough, not just longer.
Provenance
Initial creation 2026-04-22: seven MQI v3 scoring bugs found and fixed via the DS/DE/AI-eng three-lens approach in rusty-bloomnet. The latestSessionMqi || currentMqi coupling bug (model-quality.js
Upgrade 2026-04-22b: Design Quality Lens draws from topics/wiki-image-eval-rubric 12-dimension rubric (developed 2026-04-03 for vault wiki visualization scoring). DB/cache mutation verification protocol drawn from a pattern observed across multiple jobs-apply and SKL audit sessions where buttons returned 200 but silently failed to write. Self-improving loop uses Pattern 2 (Skill Crystallization) + Pattern 3 (Metric Ratchet) from skills/self-improving-agent-patterns, adapted for QA skill evolution rather than metric optimization.
Ralph loop adaptation: the original definitions/ralph-loop iterates over spec files to build new components. Here, it iterates over issues to fix existing ones. The invariants carry over: single-issue scoping (one issue per loop, no scope creep), error tolerance (a failed fix does not abort the catalog), throwaway hypothesis (the first hypothesis is usually wrong; commit only after the interaction test passes).
Cross-Pattern Context
- skills/karpathy-ratchet: use when a scoring metric plateaus; the ratchet optimizes a tunable config, the QA loop fixes correctness bugs. Different targets, same measurement discipline.
- skills/self-improving-agent-patterns: Pattern 2 (Skill Crystallization) and Pattern 3 (Metric Ratchet) underpin Phase 4 of this skill.
- definitions/ralph-loop: the original autonomous build loop adapted here for issue-driven QA iteration.
- skills/debug-capture-system: for capturing console/screenshot evidence during the visual check phase.
- topics/wiki-image-eval-rubric: 12-dimension rubric adapted for the Design Quality Lens (Phase 1d).
- topics/visual-output-routing: routing framework when the issue involves chart type mismatch (R viz vs Figma).
- topics/brand-token-chart-families: brand token system; charts should draw from brand$ tokens, not hardcoded colors.
Visual Enrichment
| Finding Type | Tool | Specification |
|---|---|---|
| Issue catalog by severity + category | R viz (skills/r-visualization-pipeline) | Family: CMP grouped bar, Template: Journal |
| Issue close rate across runs (ratchet) | R viz (skills/r-visualization-pipeline) | Family: TS multiline, Template: Journal |
| Verification stack coverage map | Figma MCP (generate_diagram) | Type: Matrix/flowchart: issues × layers |
| Corpus issue frequency heatmap | R viz (skills/r-visualization-pipeline) | Family: COR heatmap, Template: Journal |