AI Model Escalation Is Demand-Driven, Not Supply-Driven
479 sessions, $221.28, four active days. The model did not stay constant: 194 Sonnet sessions at 3.07 minutes each shifted to 48 Opus sessions at 4.27 minutes each: not because the configuration changed, but because the easy targets ran out.
AI Model Escalation Is Demand-Driven, Not Supply-Driven
The hypothesis going in was that a research pipeline would naturally escalate model weight as easy targets were exhausted. Four active days confirmed it precisely.
January 28 was the inaugural run: 79 sessions, $77.47, 569 minutes of elapsed time. The cost-to-duration ratio identifies this as cheap-model territory: most session time went to fetches, not reasoning. MQI landed at 0.3213 (watch status). The pipeline was calibrating before volume hit.
January 29 was the throughput peak: 194 sessions at 3.07 minutes average, Claude Sonnet dominant across every top log. Five measured token windows ran 79K, 84K, 114K, 37K, and 75K. The 114K window sat within 14K of the Sonnet ceiling and almost certainly triggered summarization. $38.60 bought breadth, not depth. That was the right trade for a sweep pass.
January 30 was the shift. Opus replaced Sonnet. Session count fell to 104; elapsed minutes dropped to 251; cost rose to $45.49. The token windows compressed by an order of magnitude: 5K to 10K on Opus versus 37K to 114K on Sonnet the day before. Fewer calls, smaller contexts, same cost band, qualitatively different processing.
January 31 continued the convergence: 48 sessions, 205 minutes, $48.02. Average session length grew from 2.4 to 4.27 minutes: a 77% increase over the prior day. The pipeline had reached the targets that Sonnet queued for deep reasoning. MQI hit its floor at 0.1737, the week’s only error day. Per-session cost was the highest of the week. The pipeline was doing harder work than any prior session.
The critical finding is where the escalation signal came from. No routing rule changed. No configuration was updated. The target pool composition changed: the Monday batch contained a full mix; by Thursday only the hard targets remained. Session architecture determined the cost trajectory more than model selection ever could have. A budget that planned around the cheapest model in the rotation would have been wrong by a factor of three on the final days.
February 1 closed at a different pace: 54 sessions, 99 minutes, $11.70. Claude Haiku dominated as the active pipeline wound down and lighter verification tasks took over. MQI recovered to 0.2966 as volume eased. The week-average MQI of 0.2639 against the prior week’s 0.3172 marks a -0.0533 delta: a meaningful degradation at this session volume.
Transferable insight: Model escalation in research pipelines is driven by target difficulty, not configuration. When easy targets are exhausted, the remaining queue forces deeper models regardless of routing defaults. Budget by difficulty tier, not by model selection.
Your First Production Pipeline Run Is a Calibration Event
Mixin: first runs measure the system, they do not deliver output
The January 28 inaugural run produced 79 sessions, $77.47, 569 minutes of elapsed time. Those numbers are not output metrics. They are calibration data.
The pipeline was learning its own operating parameters: what the cost profile looks like against a real target pool at a real scale, what MQI reads under production load, where the first failures appear. MQI at 0.3213 on inauguration day is the baseline, not a benchmark to beat.
The commit appearing twice on that day tells the full story. Gateway timestamp injection shipped; a monotonicity flaw was found in the same session; the fix re-shipped corrected within hours. A timestamp injection that does not guarantee ordering is not a timestamp injection: it is an event log with a race condition. Finding and fixing it on day one is not embarrassing. It is exactly what a calibration event should produce.
The distinction between a delivery and a calibration event matters for how you read the numbers that follow. Session 2 through session 79 were not producing finished research output. They were producing data about how the pipeline behaves. The real output was the corrected gateway implementation and the MQI baseline against which every future week can be compared.
Pipelines that skip the explicit calibration frame tend to ship first-run output as if it were production-grade, then spend weeks reverse-engineering why the data quality is inconsistent. The calibration frame keeps the first run honest: its job is to measure, not to deliver.
Transferable insight: Treat the first production run of any new pipeline as a calibration event, not a delivery. The data it produces is about the system itself. Establishing a baseline takes priority over extracting output from that first run.
634 Commits From a Contributor Who Became a Maintainer
Mixin: contribution patterns have visible phase transitions
While investor-research ran 479 sessions, openclaw processed 634 commits across the same seven days: 159 on Monday, 109 on Tuesday, 58 on Wednesday, 39 on Thursday, 62 on Friday, 119 on Saturday, 88 on Sunday.
The work ranged across contributor PR integration, OAuth email normalization, Telegram token and thread-id normalization, Windows spawn path hardening for ENOENT errors on yarn, and iOS TTS path shipping. The surface area is unusual for a single week: cross-platform runtime issues, identity normalization across three protocols, and mobile audio all in seven days.
The contributor-then-maintainer pattern appeared on multiple consecutive days. An outside contributor catches an entry-point bug; the maintainer patches the exit-point or inverse. The pattern has a specific shape: the contributor’s PR reveals the boundary correctly; the maintainer’s follow-on commit moves the fix to the right layer. When this recurs on consecutive days, it means the codebase has found external contributors who understand the system well enough to probe real boundaries rather than surface-level issues.
634 commits while a parallel pipeline ran 479 sessions without interference is the practical argument for asynchronous workstreams. The pipeline did not need supervision. The commit work did not need the pipeline to pause. The two activities ran on different cognitive tracks and neither blocked the other. Research pipelines that can run without attention are the enabling condition for this kind of parallel throughput.
Transferable insight: Contribution patterns have visible phase transitions. When external contributors start finding the correct boundary to probe rather than the surface symptom, the architecture has communicated itself through the code. That is a more reliable signal than documentation coverage.
Zeitgeist
By the Numbers
| Metric | Value |
|---|---|
| Sessions total | 479 |
| Total cost | $221.28 |
| Avg cache-hit rate | 95.89% |
| Peak day (Jan 29) | 194 sessions, $38.60, Sonnet dominant |
| Deepest day (Jan 31) | 48 sessions, $48.02, Opus dominant |
| Lightest day (Feb 1) | 54 sessions, $11.70, Haiku dominant |
| MQI high (Jan 28) | 0.3213 (watch) |
| MQI low (Jan 31) | 0.1737 (error) |
| MQI avg this week | 0.2639 |
| MQI delta vs prior week | -0.0533 |
| Openclaw commits | 634 |
| Projects active | 2 (investor-research, openclaw) |
| Session depth increase Jan 29 to Jan 31 | +39% (3.07 to 4.27 min avg) |
Changelog
260507: Generated by journalize-weekly (topic-first format, v2 regeneration)
SPARSE week: session telemetry exists for Jan 28 through Feb 1 only; Jan 26-27 had no recorded sessions. Article synthesized from existing weekly (source: journalize-weekly backfill 260506), packet data, and editorial topic picks. Experiment and breakthrough links generalized; private project references not present in this article.