Experiment investor-research

A research pipeline will naturally demand model escalation as easy targets are exhausted, with the transition visible in session-length growth and dominant-model shift from Sonnet to Opus over a multi-day run.

Jan 28: 569 min, $3.09, cheap-model territory (long background fetches, minimal reasoning). Jan 29-30: session count peaked at 104/day as Sonnet swept

ai-agentsinvestor-researchmodel-optimizationsonnet
Hypothesis

A research pipeline will naturally demand model escalation as easy targets are exhausted, with the transition visible in session-length growth and dominant-model shift from Sonnet to Opus over a multi-day run.

Result: confirmed
Key Findings

Jan 28: 569 min, $3.09, cheap-model territory (long background fetches, minimal reasoning). Jan 29-30: session count peaked at 104/day as Sonnet swept breadth targets. Jan 31: session count halved to 48, average session length grew 77% (4.27 min vs 2.4 min), Opus 4.5 dominant, cost jumped to $48.02. The pipeline naturally shifted from throughput to reasoning as easy targets were exhausted. The model-swap was not pre-planned but emerged from the target queue: complex targets that Sonnet queued for deep reasoning were processed by Opus on Jan 31.

Changelog

DateSummary
2026-04-28Audit pass: frontmatter + structure normalization
2026-01-28Initial creation

Hypothesis

Research pipelines face a natural tension between breadth (covering many targets cheaply) and depth (reasoning deeply about complex targets). The hypothesis was that a multi-day research pipeline would naturally demand model escalation as easy targets were exhausted: visible in growing session lengths and a shift in dominant model from Sonnet to Opus: rather than staying productive on a single model tier throughout the entire run.

Method

The investor-research pipeline ran from January 28 through January 31. No explicit model-selection logic was built in; the pipeline used whatever model was configured, and the operator (me) could swap models between runs. The experiment tracked:

MetricHow Measured
Session countbloomnet.db session telemetry
Session durationTotal minutes per day
CostAPI spend per day
Dominant modelModel with majority of tokens per day
Session length trendAverage minutes per session

The key observation was whether the pipeline would naturally demand different model capabilities as it progressed through its target queue, and whether the cost profile would reflect that demand.

Results

DateSessionsMinutesCostAvg min/sessionDominant modelCharacter
Jan 282569$3.09284.5Cheap (Sonnet)Inaugural run, mostly fetch-wait
Jan 29~100+~250~$25~2.4SonnetPeak throughput sweep
Jan 30104250~$302.4SonnetContinued breadth coverage
Jan 3148205$48.024.27Opus 4.5Deep reasoning, 77% longer sessions

The model-swap was not pre-planned but emerged from the target queue. Sonnet exhausted the easy targets (those requiring primarily information retrieval and light summarization) by Jan 30. The remaining targets required deeper reasoning: cross-referencing multiple sources, evaluating contradictory signals, producing structured assessments. These were naturally routed to Opus on Jan 31.

The cost jump from $3.09 to $48.02 (15x) while session count dropped from 569-minute background runs to 48 sessions of 4.27 minutes each reflects the fundamental cost difference between “pipeline waiting on fetches” and “model doing hard reasoning.”

Findings

  1. Model-swap emerges naturally from target complexity. No explicit routing logic was needed. The operator recognized that Sonnet was producing shallow results on complex targets and switched to Opus. A production system could automate this with a complexity classifier, but manual switching worked for the initial run.

  2. Cost is not the right optimization target. The $48.02 Jan 31 spend produced research outputs that the $3.09 Jan 28 run could not. Optimizing for cost would have kept the pipeline on Sonnet and produced breadth without depth. The right metric is actionable-output-per-dollar, not dollar-per-session.

  3. Session length is the leading indicator of model demand. When average session length grows (2.4 min to 4.27 min, a 77% increase), the pipeline is hitting targets that require more context per call. This is the signal to consider model escalation.

  4. Long background runs are cheap. The 569-minute inaugural run at $3.09 ($0.005/min) confirms that research pipelines dominated by fetch-and-wait are nearly free in API terms. The cost lives in the reasoning, not the orchestration.

Next Steps

The model-swap pattern observed here is a candidate for automation: a complexity classifier on the target queue could route easy targets to Sonnet and complex targets to Opus without manual intervention. Session-length growth (the 77% increase) is the trigger signal.

Source