Experiment investor-research

A research pipeline will naturally demand model escalation as easy targets are exhausted, with the transition visible in session-length growth and dominant-model shift from Sonnet to Opus over a multi-day run.

Jan 28: 569 min, $3.09, cheap-model territory (long background fetches, minimal reasoning). Jan 29-30: session count peaked at 104/day as Sonnet swept

January 27, 2026

ai-agentsinvestor-researchmodel-optimizationsonnet

Hypothesis

A research pipeline will naturally demand model escalation as easy targets are exhausted, with the transition visible in session-length growth and dominant-model shift from Sonnet to Opus over a multi-day run.

Result: confirmed

Key Findings

Jan 28: 569 min, $3.09, cheap-model territory (long background fetches, minimal reasoning). Jan 29-30: session count peaked at 104/day as Sonnet swept breadth targets. Jan 31: session count halved to 48, average session length grew 77% (4.27 min vs 2.4 min), Opus 4.5 dominant, cost jumped to $48.02. The pipeline naturally shifted from throughput to reasoning as easy targets were exhausted. The model-swap was not pre-planned but emerged from the target queue: complex targets that Sonnet queued for deep reasoning were processed by Opus on Jan 31.

Changelog

Date	Summary
2026-04-28	Audit pass: frontmatter + structure normalization
2026-01-28	Initial creation

Hypothesis

Research pipelines face a natural tension between breadth (covering many targets cheaply) and depth (reasoning deeply about complex targets). The hypothesis was that a multi-day research pipeline would naturally demand model escalation as easy targets were exhausted: visible in growing session lengths and a shift in dominant model from Sonnet to Opus: rather than staying productive on a single model tier throughout the entire run.

Method

The investor-research pipeline ran from January 28 through January 31. No explicit model-selection logic was built in; the pipeline used whatever model was configured, and the operator (me) could swap models between runs. The experiment tracked:

Metric	How Measured
Session count	bloomnet.db session telemetry
Session duration	Total minutes per day
Cost	API spend per day
Dominant model	Model with majority of tokens per day
Session length trend	Average minutes per session

The key observation was whether the pipeline would naturally demand different model capabilities as it progressed through its target queue, and whether the cost profile would reflect that demand.

Results

Date	Sessions	Minutes	Cost	Avg min/session	Dominant model	Character
Jan 28	2	569	$3.09	284.5	Cheap (Sonnet)	Inaugural run, mostly fetch-wait
Jan 29	~100+	~250	~$25	~2.4	Sonnet	Peak throughput sweep
Jan 30	104	250	~$30	2.4	Sonnet	Continued breadth coverage
Jan 31	48	205	$48.02	4.27	Opus 4.5	Deep reasoning, 77% longer sessions

The model-swap was not pre-planned but emerged from the target queue. Sonnet exhausted the easy targets (those requiring primarily information retrieval and light summarization) by Jan 30. The remaining targets required deeper reasoning: cross-referencing multiple sources, evaluating contradictory signals, producing structured assessments. These were naturally routed to Opus on Jan 31.

The cost jump from $3.09 to $48.02 (15x) while session count dropped from 569-minute background runs to 48 sessions of 4.27 minutes each reflects the fundamental cost difference between “pipeline waiting on fetches” and “model doing hard reasoning.”

Findings

Model-swap emerges naturally from target complexity. No explicit routing logic was needed. The operator recognized that Sonnet was producing shallow results on complex targets and switched to Opus. A production system could automate this with a complexity classifier, but manual switching worked for the initial run.
Cost is not the right optimization target. The $48.02 Jan 31 spend produced research outputs that the $3.09 Jan 28 run could not. Optimizing for cost would have kept the pipeline on Sonnet and produced breadth without depth. The right metric is actionable-output-per-dollar, not dollar-per-session.
Session length is the leading indicator of model demand. When average session length grows (2.4 min to 4.27 min, a 77% increase), the pipeline is hitting targets that require more context per call. This is the signal to consider model escalation.
Long background runs are cheap. The 569-minute inaugural run at $3.09 ($0.005/min) confirms that research pipelines dominated by fetch-and-wait are nearly free in API terms. The cost lives in the reasoning, not the orchestration.

Next Steps

The model-swap pattern observed here is a candidate for automation: a complexity classifier on the target queue could route easy targets to Sonnet and complex targets to Opus without manual intervention. Session-length growth (the 77% increase) is the trigger signal.

Source

Journal: journal/daily/2026-01-28: pipeline launch
Journal: journal/daily/2026-01-31: model-swap to Opus, 77% longer sessions