Journal

Guardrail effectiveness audit ships: 93% blind spots revealed

guardrailsauditdata-engineeringinterrupted-time-series

Signal

Multi-layer guardrail architectures (hook + memory-rule) achieve 100% failure reduction where tested. But we’re only measuring 7% of our guardrail investment: 140 of 150 failure modes have zero observability data. The audit pipeline itself is the most valuable deliverable: it makes the blind spot visible and measurable.

Evidence

  • Full ITS pipeline: 5 stages, 771 sessions, 150 failure modes, 123 interventions, 131 incidents
  • hook-hard-block + memory-rule = 100% reduction (Gmail MCP: 5 occurrences → 0, with 4 blocked attempts proving load-bearing)
  • memory-rule alone = 33% resolution rate (1/3). Necessary but insufficient for enforcement
  • 12 always-on hooks fire on every tool call (latency budget unknown: timing data not yet parsed)
  • 1 pruning candidate: ralph-loop guard never fired in 470+ sessions
  • Smoke detector paradox: guard-db-destruction shows 0→0.05 rate because it surfaced a latent failure, not caused one
  • Pipeline at ~/vault/guardrails/_analysis/ (7 scripts, 24 tests, full e2e)
  • Dashboard: dashboards/guardrail-effectiveness
  • 150 individual study entries: guardrails/_index

So What

The guardrail system works where it’s instrumented: but it’s instrumented almost nowhere. We’ve been investing in guards based on recency bias (whatever burned us last) rather than coverage analysis. The 93% no-data rate means our “guardrail architecture” is mostly vibes.

The architecture pattern finding is actionable: promote hook-hard-block + memory-rule as the default pattern for critical/high severity modes. Memory rules are free to write but don’t guarantee enforcement. The marginal cost of adding a hook script is low and the marginal value is the enforcement ceiling.

What’s Next

  1. Instrument the 7 open failure modes (add session-tagging to confirm rates)
  2. Parse hook timing data for latency budget (12 always-on hooks is a lot)
  3. Deploy hook-hard-block for remaining open/critical modes (6 candidates)
  4. Weekly automated re-run to track drift
  5. Prune ralph-loop guard after 90 more zero-incident sessions

Log

  • Built 5-stage ITS pipeline from scratch (session census → failure modes → interventions → incidents → effectiveness)
  • Generated 150 vault study entries + aggregate dashboard
  • Discovered smoke detector paradox (db-destruction guard)
  • Identified 1 pruning candidate, 7 open gaps, 3 resolved modes
  • Full experiment entry: experiments/bloomnet/2026-05-05-guardrail-effectiveness-audit

Blockers

  • Hook timing data not yet available (needs bloomnet-hooks.jsonl parsing)
  • 32 interventions unmapped to failure modes (style/process rules without clear failure-mode match)

Notes

This is the first time we’ve treated guardrails as a data problem rather than a design problem. The ITS methodology borrowed from epidemiology (interrupted time series for policy evaluation) adapts cleanly to agentic failure prevention. The pipeline is idempotent and re-runnable: it should become a weekly cron.