The Boiling Frog Attack: Why AI Safety Fails at Multi-Turn
Single-turn refusal rate: 100%. Cascade attack success rate at 4+ turns: 100%. The asymmetry is categorical, not marginal, and it held across every provider and every depth tested. Feb 9-15 split into five active days and two git-only days: 114 sessions, $164.88 in compute, 74% of that driven by a single eval project. The breakout finding was architectural: content filters that evaluate turns in isolation cannot detect the attack class that succeeds in every case.
The Boiling Frog Attack: Why AI Safety Fails at Multi-Turn
Single-turn jailbreaks fail for a simple reason: the filter reads the request, matches it against known patterns, blocks it. Red-team research built on this model for years. The assumption was structural: content filters are the right unit, and hardening them is the right strategy.
The cascade attack experiment broke that assumption.
The method ran in three phases. Phase 1 anchored a legitimate framing: academic research, fiction, security analysis. Any model accepts these framings on cold start. Phase 2 escalated each subsequent turn by a controlled increment, using the model’s prior responses as implicit permission for the next step. Phase 3 submitted the target request as a natural continuation of the established context. Tested at depths of 3, 5, 8, and 12 turns across 3 providers.
Result: confirmed at every depth, against every provider tested. Single-turn refusal rate held at 100%. Cascade attacks of 4+ turns succeeded in every case.
The mechanism is what makes this a category finding rather than a technique. Each turn is evaluated in isolation. No single turn looks alarming. But the model’s consistency bias does the work: prior responses function as implicit permission for each subsequent step. By turn 4, the originally-blocked request arrives as a logical continuation of an established context the model has already validated.
This is not a quirk of any specific provider’s implementation. It applies wherever filters operate turn-by-turn without cross-turn context tracking. The architectural blind spot is not in the filter’s pattern matching; it’s in the decision to treat turns as independent events.
The practical implication is harder to fix than the technical one. Stateless turn-by-turn evaluation is the default architecture because it’s cheap: no memory, no history, no cross-turn state. Cross-turn context tracking requires storing conversation history, running a second pass that evaluates drift rather than content, and accepting higher latency on every turn. The economics push against the fix.
The name that stuck internally was the “boiling frog” attack: a frog placed in boiling water jumps out; a frog in water heated gradually does not notice until it’s too late. The temperature increments that seem safe individually produce a lethal outcome in sequence.
The finding was shared back into the red-team research body as a documented attack class. It seeded defensive pattern work in subsequent projects. The practical defensive direction: cross-turn context drift scoring, not per-turn content matching. Score each conversation’s cumulative trajectory; trigger on the derivative, not the absolute position. The engineering cost is real. The alternative is a category of attack that succeeds in every case.
Transferable insight: Any safety check that evaluates requests in isolation is blind to multi-turn manipulation. Effective defense requires stateful context tracking, not better per-turn pattern matching.
Seeded PRNG: 73 Encoding Variants, Zero Snapshot Drift
Mixin: determinism as a prerequisite for testability
The campaign generation pipeline contained 73 encoding variants, each applying a different mathematical transform to content: bit rotation, XOR, modular arithmetic. Each variant called Math.random() internally. Any test asserting on campaign output was testing a non-deterministic system: two runs with identical inputs could produce different outputs, and neither would be wrong.
The seeded PRNG experiment replaced all Math.random() calls with a seeded Mulberry32 PRNG. The 73 variants were the highest-risk surface: an audit found 47 internal Math.random() calls that needed replacement. Three of those 47 could not be trivially swapped because they depended on platform timing for entropy seeding. Those 3 were refactored to receive the seed from the campaign-level PRNG rather than generating their own.
The result: all 73 encoding variants produce deterministic output given the same seed. Snapshot testing became fully operational. The Jest suite calls generateCampaign(seed) and diffs the output against committed golden files. Any regression forces an explicit decision rather than allowing silent drift.
The finding generalizes further than the immediate project. Probabilistic systems produce probabilistic tests: coverage gaps that look like noise, regressions that look like variance. Deterministic systems allow snapshot coverage of arbitrarily complex output structures without writing explicit assertions for every field. The discipline cost is moderate; the test coverage payoff is compounding.
Transferable insight: Systems that embrace determinism as a design principle become testable in ways probabilistic systems never can. The seeding decision unlocks an entire testing strategy, not just one test.
$110 on Thursday: How Eval Projects Concentrate Cost
Mixin: session burst patterns in AI project portfolios
Thursday, Feb 12: 45 sessions, $110.16, 67% of the week’s $164.88 in a single day. The eval project ran all 5 days and drove 74% of total weekly spend on its own: 65 sessions, $122.38, an average duration of 2.1 minutes per session.
The token signature tells the story more precisely than the cost alone. Eval logged 10,242 input tokens and 6,019 output tokens against 39.1M cache-read tokens. The cache-read volume exceeds live input by a factor of roughly 3,800. This is not a project generating long outputs; it’s a project querying large, stable cached contexts repeatedly. The pattern is consistent with an evaluation harness running batched jobs against a fixed prompt structure: the system prompt is warm, the test inputs are small, and each session adds a thin layer of steering on top of a pre-warmed foundation.
The 0.9941 average cache hit rate across the full week confirms the infrastructure is working as designed. Individual days ran at 0.9999 on Monday, Wednesday, and Thursday. Friday dropped to 0.984 on 22 sessions, the one outlier in an otherwise clean efficiency picture.
What the numbers cannot show: Feb 14-15 produced zero session telemetry against the week’s highest single-day commit volumes. The bloomnet.db gap means cost, model mix, and session duration data for those two days are unavailable. The session record shows a 5-day week; the git record shows a 7-day one.
MQI ran below the 0.31 warning threshold all five tracked days. Composite Z moved from -0.924 on Monday to -0.5149 by Thursday, then regressed to -0.6279 on Friday. The midweek plateau and Friday reversal suggest the quality dimensions that lifted toward warning status on Wednesday and Thursday did not sustain. The week closed below where it peaked, which the next-week action list logs as an open question: which composite Z dimensions are responsible for the floor at -0.51.
Transferable insight: In AI project portfolios, cost concentration in evaluation workloads is expected, not anomalous. The diagnostic signal is the cache-read-to-input ratio, not the session count or raw cost.
Zeitgeist
By the Numbers
| Metric | Value |
|---|---|
| Sessions | 114 |
| Total cost | $164.88 |
| Largest single day | $110.16 (Thu Feb 12, 45 sessions) |
| Eval share of spend | 74% |
| Avg cache hit rate | 0.9941 |
| Avg MQI | 0.2534 |
| MQI delta vs prior week | -0.0086 |
| Projects active | 6 |
| Git-only days | 2 (Sat-Sun) |
| Cascade attack providers tested | 3 |
| Cascade attack depths tested | 4 (3, 5, 8, 12 turns) |
| PRNG encoding variants | 73 |
Changelog
260507: Generated by journalize-weekly (topic-first format, v2 regeneration)
Rewrote from per-project format to topic-first. Primary: cascade attack structural blind spot. Mixins: seeded PRNG + cost concentration. Stripped private project refs from frontmatter and body. Added definitions: cascade-attack, seeded-prng, snapshot-testing.