Experiment OpenClaw

Automatic session recovery after WhatsApp disconnection will reduce manual intervention to near-zero

Auto-recovery handled 94% of disconnections without human intervention. Remaining 6% were auth token expirations requiring manual QR re-scan.

experimentwhatsappreliabilityauto-recovery
Hypothesis

Automatic session recovery after WhatsApp disconnection will reduce manual intervention to near-zero

Result: confirmed
Key Findings

Auto-recovery handled 94% of disconnections without human intervention. Remaining 6% were auth token expirations requiring manual QR re-scan.

Changelog

DateSummary
2026-04-12Initial creation

Hypothesis

WhatsApp Web sessions are inherently fragile. Network interruptions, phone sleep mode, WhatsApp server restarts, and battery optimization on the paired phone all cause silent disconnections. The default behavior is total failure: the automation stops, messages queue up unsent, and nobody knows until a human checks. The hypothesis was that an automatic recovery system: crash detection combined with exponential backoff reconnection: could reduce the need for manual intervention to near-zero, making the system viable for unattended operation over days rather than hours.

The core assumption was that most disconnections are transient (network blips, server-side rebalancing) rather than terminal (auth revocation, phone number change). If that held, reconnecting with the existing session state should succeed most of the time.

Method

The recovery system was built as a three-layer watchdog. Layer 1 monitors the WebSocket connection to WhatsApp Web, detecting disconnection within 5 seconds via heartbeat timeout. Layer 2 attempts reconnection with exponential backoff: 2s, 4s, 8s, 16s, capping at 60s between attempts, with a maximum of 20 attempts before declaring the session dead. Layer 3 logs every disconnection event with timestamps, recovery duration, and failure classification for post-hoc analysis.

Testing ran for 14 days in November 2025 on a single WhatsApp Business account handling approximately 200 messages per day. Disconnection events were categorized as: network transient (reconnect succeeds within 30s), extended outage (reconnect succeeds after 30s-5min), auth expiration (reconnect fails, requires QR re-scan), and unknown (reconnect fails for uncategorized reasons).

No synthetic disconnections were injected. All events were organic, occurring during normal operation.

Results

Over the 14-day test period, the system experienced 47 disconnection events. Of those, 44 (94%) were recovered automatically without any human intervention. The breakdown: 31 network transients (average recovery time 8s), 13 extended outages (average recovery time 2m 14s), and 3 auth expirations that required manual QR re-scan.

The 3 auth expirations all occurred when the paired phone had been offline for over 14 hours (overnight charging with airplane mode, or phone restart). WhatsApp revokes the web session token after prolonged phone disconnection, and there is no programmatic way to re-authenticate without the QR flow.

The hypothesis is confirmed: auto-recovery reduced manual intervention from every disconnection (previously ~3-5 per day requiring human attention) to approximately once every 4-5 days for auth re-scan only.

Findings

  1. Most disconnections are transient. 66% of all events recovered in under 30 seconds. The WebSocket layer is noisier than it appears from the user-facing UI, which silently reconnects in the background.

  2. Exponential backoff prevents ban. Early testing with aggressive fixed-interval reconnection (every 2s) triggered rate limiting from WhatsApp’s servers. The backoff pattern avoided this entirely.

  3. Auth expiration is the hard boundary. No amount of reconnection logic can recover from a revoked session token. The only mitigation is keeping the paired phone online and connected, which is a user-behavior problem, not a software problem.

  4. Silent failure is worse than crash failure. The original codebase had no disconnection detection at all. Messages would silently fail to send with no error propagation. Adding explicit crash detection was more valuable than the recovery logic itself.

Next Steps

The auth expiration problem motivated the heartbeat monitoring experiment. If the system can detect that the phone has gone offline before the 14-hour token expiration window, it can alert the user proactively rather than waiting for the session to die. See experiments/openclaw/2025-11-28-heartbeat-monitoring-liveness.