Definition

Cascade Attack

A multi-turn manipulation strategy that escalates context incrementally until a safety boundary is crossed.

definitionai-safetyred-teamadversarial

A multi-turn manipulation strategy that escalates context incrementally until a safety boundary is crossed.

A cascade attack builds context across multiple conversation turns rather than submitting a single blocked request. Turn 1 establishes a legitimate anchor: academic research, fiction, security analysis. Each subsequent turn extends that anchor by a small increment. By turn 4 or 5, the target request arrives as a natural continuation of what the model has already accepted. The mechanism exploits the model’s consistency bias: prior responses function as implicit permission for the next step. Content filters evaluating each turn in isolation cannot detect the drift. A cascade attack is the AI safety analog of the boiling frog: no individual step is alarming; the cumulative trajectory is.

How It Works

Anchor with plausible framing (turn 1) → escalate by controlled increments (turns 2-N) → submit target request as continuation (turn N+1). Works at depths 3, 5, 8, and 12. Succeeds against providers whose filters evaluate per-turn without cross-turn context tracking.

Example

The RedCorsair experiment tested cascade attacks across 3 providers at 4 depths. Single-turn refusal rate: 100%. Cascade attack success rate at 4+ turns: 100%. The asymmetry is categorical, not marginal. No specific provider vulnerability: any filter architecture lacking cross-turn state is structurally blind to this attack class.