Pitfall Memory peon-notify

Hook Pipeline Data Integrity

pitfallhooksdata-engineeringOPERATIONAL

What Happened

Full audit of Claude Code hook infrastructure revealed a chain of data integrity failures across three pipelines:

  1. Failure manifest ANSI poisoning: failure-manifest-append.sh captured raw terminal output with ANSI escape codes (\x1b[0;31m) embedded in JSON string values. JSON spec forbids control characters below 0x20 in strings. 14 entries were malformed, blocking the entire flush pipeline (49 entries stuck unflushed).

  2. Failure manifest line concatenation: Some entries lacked newline separators, producing two JSON objects on one line. The JSON decoder then saw “extra data” after the first closing brace.

  3. Public Lab build broken silently: 52 definition files synced from vault lacked the required category field. Astro’s Zod schema validation failed every build. Additionally, 12 experiment files had YAML null for optional string fields; Zod .optional() means “key absent,” not “value is null.”

  4. Dead project routing: failure-flush.sh routed peon-notify entries to the archived PeonNotify directory instead of rusty-bloomnet. 12+ task files landed in a dead path.

Root Cause

Pipeline boundary hygiene: Each failure is a boundary violation:

  • Shell output (ANSI codes) leaked into a JSON data store
  • YAML null semantics leaked into a Zod string schema
  • A project rename (PeonNotify to stella/rusty-bloomnet) wasn’t propagated to the routing table

All three share a pattern: data crossed a boundary without sanitization or validation at the boundary.

Fix Applied

  1. failure-manifest-append.sh: Strip ANSI codes (sed $'s/\x1b\[[0-9;]*[a-zA-Z]//g') and control chars before JSON serialization
  2. Manifest repair script: JSON decoder walks each line, splits concatenated objects, strips control chars
  3. Public Lab: Added category field to all 52 definitions via auto-classification from tags
  4. Experiment frontmatter: Removed YAML null lines (Zod treats absent key as undefined, which is what .optional() wants)
  5. failure-flush.sh routing was already updated for stella migration (verified: peon-notify|peonnotify|peon maps to rusty-bloomnet)

Additional Findings

  • failure-detect-e2e.sh had unbound $jobs variable under set -u (bash declare -A doesn’t initialize in all shells)
  • 93 stale codeguard state directories accumulating (cleaned 51, 42 remain from active sessions)
  • Temp diagnostic script (/tmp/stella-diag.sh) left in settings.local.json (removed)
  • Watchdog log showed “No space left on device” errors from a past disk pressure incident
  • obsidian-cron.log uses GNU head -z which doesn’t exist on macOS BSD

Redemption

49 stuck failure-manifest entries flushed successfully after repair. Public Lab builds passing (192 pages, 3.12s). All hooks tested.

Key Lesson

Validate at every data boundary. Shell output is not JSON-safe. YAML null is not JavaScript undefined. A project rename is a schema migration. The accumulate-then-flush pattern works brilliantly when the accumulator produces valid records; when it doesn’t, the entire pipeline silently stops.