Breakthrough Preferences redcorsair

Cascade attack: multi-turn > single-turn for AI safety red-teaming

February 10, 2026

breakthroughredcorsairai-agents

Key Metric

Before

Single-turn jailbreak attempts blocked by safety filters

→

↓

After

Multi-turn cascades bypass filters that block equivalent ...

Single-turn jailbreaks fail. Multi-turn cascades succeed. That asymmetry is the finding that redefined the entire red-teaming approach on this project.

Context

AI safety red-teaming historically relied on single-turn prompts designed to bypass filters in one shot. The assumption was that content filters evaluate each message independently. Before this experiment, every automated jailbreak attempt hit the same wall: the filter read the request, matched it against known patterns, and blocked it. The problem wasn’t creativity in the prompts. The problem was the framing: one request, one refusal.

What Changed

The cascade attack pattern treats each conversation turn as context-building rather than an isolated request. Turn 1 establishes a plausible framing. Turn 2 extends it slightly. By turn 4 or 5, the model is in a context where the originally-blocked request seems like a natural continuation of what has already been accepted. Each turn exploits the model’s tendency to maintain consistency with prior conversation history. The progressive context shift creates a “boiling frog” effect: no single step is alarming, but the cumulative drift takes the model far outside the boundaries it would enforce on a cold start.

This was tested systematically across multiple LLM providers using a structured cascade framework with variable depth (3–7 turns) and controlled topic escalation rates.

Impact

The experiment confirmed the hypothesis: multi-turn cascades bypass filters that block equivalent single-turn requests in every tested case. The mechanism is fundamental, not a quirk of any specific provider’s implementation. Content filters that evaluate requests in isolation are structurally blind to this attack class.

Before: single-turn jailbreak attempts blocked at a 100% rate by content filters. After: cascade attacks of 4+ turns succeeded against the same filters. The asymmetry is not marginal : it is categorical.

This finding informed defensive patterns used in subsequent projects and was shared back into the red-team research body as a documented attack class.

Source

Experiment: experiments/redcorsair/2026-02-10-cascade-attack-multi-turn-jailbreak
Project: projects/redcorsair/_index
Topic: topics/redcorsair-cascade-attack-pattern