Experiment Preferences redcorsair

A post-deploy health check gate will catch silent deployment failures within one deploy cycle instead of the 10-30 commit cascades observed in the Cloudflare, Vercel, and Railway pitfalls

opsdeploymentci-cd
Hypothesis

A post-deploy health check gate will catch silent deployment failures within one deploy cycle instead of the 10-30 commit cascades observed in the Cloudflare, Vercel, and Railway pitfalls

Result: pending

Changelog

DateSummary
2026-04-06Audited: chain linked to cascade-attack (iteration 2, chain_prev set), last_audited stamped
2026-04-04Initial creation

Hypothesis

We bet that the root cause of the 10-30 commit cascades documented in the Cloudflare, Vercel, and Railway pitfalls is a missing feedback loop: the deploy pipeline reports success but the live URL still serves stale content, and the only way to discover this is by manually checking the site. Without an automated gate, the failure is detected at the “why is the feature not live?” stage : often 30 minutes later and 3 commits deep.

A post-deploy verification step that checks both platform status (wrangler pages deployments list, vercel ls) AND live URL health (HTTP 200 + version string from /api/health) creates the feedback loop at the right point: immediately after git push, before the next commit is written. The hypothesis is that this gate catches 100% of silent deploy failures within 2 minutes of deployment (the typical platform propagation window), reducing the debug-commit cascade from 10-30 commits to 1-2.

Method

  1. Build scripts/deploy-verify.sh: platform-agnostic entry point that dispatches to platform-specific checkers
  2. Cloudflare checker: wrangler pages deployments list --project-name=<name> for status + curl -s <live-url> for version endpoint. The version endpoint returns a JSON {"version": "<git-sha>", "deployed": "<timestamp>"}
  3. Vercel checker: vercel ls for Ready status + curl -s <live-url>/api/health for version
  4. Railway checker: railway logs --latest for deploy status + health endpoint
  5. Retry logic: exponential backoff (10s, 30s, 90s) with max 3 retries to handle platform propagation delay
  6. CI integration: add as post-deploy step in GitHub Actions workflows
  7. Metric: count deploy-related fix commits before/after gate introduction across redcorsair and jobs-apply

Results

Pending. Will measure:

  • Number of deploy-related fix commits per deployment
  • Time-to-detection for deploy failures
  • False positive rate (gate fails when deploy actually succeeded)

Findings

Pending.

Next Steps

After confirming the gate catches known failure classes, integrate into the pre-commit workflow so git push auto-triggers deploy + verify. Consider adding to the skills/debug-capture-system skill as a deployment layer.