Journal

Stale API key looks like a code bug: verify credentials before touching code

careerdata-engai-agentsops

2026-03-28

Signal

The OpenRouter key incident: where a stale key in ~/.zshrc caused a full pipeline outage that looked like a code bug: established a universal debugging principle: on any 401 error, verify the credential first, before touching the code.

Evidence

  • Project: projects/jobs-apply/_index: Database Audit v3.1: Neon PostgreSQL provisioned, 25 tasks across 8 phases (PII stripping, API locking, field encryption, GDPR, indexes, retention, caching, rate limiting); Schema v3 deployed; auth fallback (userId=1) removed
  • Project: projects/jobs-apply/_index: SaaS Launch Prep: all code complete, 592 tests passing, 10/10 packages build; remaining: 5 service env var groups on Vercel
  • Incident: OpenRouter key in ~/.zshrc was stale: full pipeline outage looked like a code bug; root cause: keys scrubbed from config during SaaS prep but ~/.zshrc not updated; rule established: on 401, verify key on OpenRouter dashboard FIRST
  • Project: internal audit: Karpathy Ratchet baseline: evaluator rewritten from tautological (Pearson on same scores) to 6 objective verifiable properties; DQI baseline 0.9049
  • Project: projects/peon-notify/_index: Memory watchdog with per-process/aggregate/system thresholds; auto-restart via peon-claude wrapper (max 5 retries); signal forwarding fix; Obsidian integration development
  • Project: projects/bloomnet/_index: Initial build (170 code-review sessions indicate major initial development of the React developer analytics dashboard)

So What (Why Should You Care)

The OpenRouter key incident illustrates a failure mode specific to systems with multiple credential sources. The config file was cleaned up during SaaS security prep: hardcoded secrets scrubbed as required. But ~/.zshrc still had the old key. The symptom (401 from the API) looked exactly like a code bug. The code was not the problem. The key was expired. The investigation started with the code, which was correct, and spent time debugging something that didn’t need fixing.

The rule established from this incident: verify credentials on OpenRouter dashboard before debugging code when you see a 401: is cheap to follow. It takes 30 seconds. The alternative is hours of code investigation that leads nowhere because the code is fine. Any time a working system suddenly fails with an auth error, the most likely cause is a credential change somewhere in the environment, not a regression in code that was previously working.

The broader pattern applies beyond API keys. When a working system suddenly fails, environmental changes are higher probability than code bugs because code doesn’t change unless someone changes it, but environments change constantly: keys rotate, certificates expire, env vars get overwritten, system packages update. The diagnostic question “what changed recently in the environment?” is often more productive than “what changed recently in the code?”

The DQI baseline of 0.9049 established for internal audit is the starting line for the Karpathy ratchet: every future improvement is measured against this number. The evaluator rewrite (from tautological Pearson correlation to 6 objective verifiable properties) is what made the baseline meaningful. A tautological baseline tells you nothing; an objective baseline tells you exactly where you start and how much you’ve improved.

projects/bloomnet/_index initial build today (170 code-review sessions) reflects a system that went from concept to codebase in a single day: the high session count suggests a complete initial implementation pass rather than incremental development.

What’s Next

  • Complete Vercel env var setup for SaaS launch (5 groups remaining)
  • Run first Karpathy ratchet iteration from the 0.9049 DQI baseline

Log

  • projects/jobs-apply/_index: Database Audit v3.1
  • Neon PostgreSQL provisioned (5 Vercel env vars set)
  • 25 tasks across 8 phases: PII stripping, API locking, field encryption, GDPR, indexes, retention, caching, rate limiting
  • Schema v3 deployed to Neon + local SQLite
  • Auth fallback (userId=1) removed, hardcoded secrets scrubbed
  • SaaS Launch Prep: all code complete, 592 tests passing, 10/10 packages build
  • Remaining: 5 Vercel env var groups (Stripe, Google OAuth, SendGrid, Sentry, Blob)
  • OpenRouter key incident: stale key in ~/.zshrc caused full outage; looked like code bug
  • Root cause: keys scrubbed from config but ~/.zshrc not updated
  • Rule: on 401, verify key on OpenRouter dashboard FIRST before debugging code
  • internal audit: Karpathy Ratchet baseline
  • Evaluator rewritten: before: tautological (Pearson on same scores); after: 6 objective verifiable properties
  • DQI baseline: 0.9049 (with —crawl —skip-url-check)
  • First ratchet run started
  • projects/peon-notify/_index: memory watchdog with per-process/aggregate/system thresholds
  • Auto-restart via peon-claude wrapper (max 5 retries)
  • Signal forwarding fix for wrapper process
  • Obsidian integration development (accumulate/flush architecture)
  • projects/bloomnet/_index: initial build: React dashboard for visualizing Claude Code usage patterns