Guardrail effectiveness audit ships: 93% blind spots revealed
Signal
Multi-layer guardrail architectures (hook + memory-rule) achieve 100% failure reduction where tested. But we’re only measuring 7% of our guardrail investment: 140 of 150 failure modes have zero observability data. The audit pipeline itself is the most valuable deliverable: it makes the blind spot visible and measurable.
Evidence
- Full ITS pipeline: 5 stages, 771 sessions, 150 failure modes, 123 interventions, 131 incidents
hook-hard-block + memory-rule= 100% reduction (Gmail MCP: 5 occurrences → 0, with 4 blocked attempts proving load-bearing)memory-rulealone = 33% resolution rate (1/3). Necessary but insufficient for enforcement- 12 always-on hooks fire on every tool call (latency budget unknown: timing data not yet parsed)
- 1 pruning candidate: ralph-loop guard never fired in 470+ sessions
- Smoke detector paradox:
guard-db-destructionshows 0→0.05 rate because it surfaced a latent failure, not caused one - Pipeline at
~/vault/guardrails/_analysis/(7 scripts, 24 tests, full e2e) - Dashboard: dashboards/guardrail-effectiveness
- 150 individual study entries: guardrails/_index
So What
The guardrail system works where it’s instrumented: but it’s instrumented almost nowhere. We’ve been investing in guards based on recency bias (whatever burned us last) rather than coverage analysis. The 93% no-data rate means our “guardrail architecture” is mostly vibes.
The architecture pattern finding is actionable: promote hook-hard-block + memory-rule as the default pattern for critical/high severity modes. Memory rules are free to write but don’t guarantee enforcement. The marginal cost of adding a hook script is low and the marginal value is the enforcement ceiling.
What’s Next
- Instrument the 7 open failure modes (add session-tagging to confirm rates)
- Parse hook timing data for latency budget (12 always-on hooks is a lot)
- Deploy hook-hard-block for remaining open/critical modes (6 candidates)
- Weekly automated re-run to track drift
- Prune ralph-loop guard after 90 more zero-incident sessions
Log
- Built 5-stage ITS pipeline from scratch (session census → failure modes → interventions → incidents → effectiveness)
- Generated 150 vault study entries + aggregate dashboard
- Discovered smoke detector paradox (db-destruction guard)
- Identified 1 pruning candidate, 7 open gaps, 3 resolved modes
- Full experiment entry: experiments/bloomnet/2026-05-05-guardrail-effectiveness-audit
Blockers
- Hook timing data not yet available (needs bloomnet-hooks.jsonl parsing)
- 32 interventions unmapped to failure modes (style/process rules without clear failure-mode match)
Notes
This is the first time we’ve treated guardrails as a data problem rather than a design problem. The ITS methodology borrowed from epidemiology (interrupted time series for policy evaluation) adapts cleanly to agentic failure prevention. The pipeline is idempotent and re-runnable: it should become a weekly cron.