Journal

Test harness dedupe day with haiku-heavy session mix

February 16, 2026

voice-generatedtechtesting

Signal

8 sessions today, mostly on brandhouse_ppt and misc Documents work, 207 openclaw commits landed in parallel The dedup pass hit channel + monitor action suites and then agent harnesses + routing fixtures Model mix skewed haiku-heavy: 7,070 haiku tokens against 124 opus tokens across the tracked runs

Evidence

openclaw (207 commits, +37,402/-16,554): dedup refactor across test suites, shared install and embedding utilities extracted into reusable modules. The test tree had been repeating the same setup blocks across suite after suite, and the extraction means one source of truth for the install path and one for the embedding harness. Every future suite that needs either one now imports instead of copying.

Sessions: 8 total, 2 on misc Documents, 4 on brandhouse_ppt, 2 home-dir. Model tokens: haiku-4-5 at 7,070, opus-4 at 124, nothing else lit. That token split is interesting on its own: haiku did effectively all of the day’s surgery, opus barely showed up, and the commits still landed clean.

CI error triage commit shipped. Not a feature, just unblocks the green path. A green CI is the unsexy precondition for every other piece of work that day; I do not get to call it optional. Perf pass on process timeout fixtures again: the ratchet keeps ratcheting. That’s three days in a row of timeout trims, each one smaller than the last but still compounding.

The brandhouse_ppt sessions were the human-time majority of the day at 4 minutes tracked, but openclaw carried almost all of the commit volume. That asymmetry matters: multi-project days do not have to mean multi-project commits.

So What

Shared install and embedding utilities being extracted is the right kind of duplication kill: it means the test tree was repeating setup, not just setup lines. Setup lines are cheap to consolidate; setup semantics (what order, what side effects, what teardown) are the expensive thing to get uniform across a suite. Extracting into a shared utility forces the semantics to be one thing, not N things that happen to look similar.

This is what a “boring” refactor day looks like on the scale: 20,848 net additions and the tests still run. Haiku doing this volume of surgery is a cost-curve signal worth watching. If haiku can do dedup refactors at a 98-percent share without regressions, the opus floor for refactoring tasks is higher than I’ve been treating it.

What’s Next

If haiku handled 98% of the token budget today with no quality complaints in the commits, where is the opus floor for refactoring tasks? I want to run a deliberate split next week: the same class of refactor, half on haiku, half on opus, and see whether any difference shows up in review feedback or regression rate. If not, the floor moves.

Second: the extracted install and embedding utilities need at least one test that exercises them independent of the suites that use them. Otherwise a regression in the shared helper only shows up when every suite using it breaks, and that’s the wrong signal shape.

Third: I want to reconcile the session-versus-commit asymmetry today. Eight sessions tracked, mostly on brandhouse_ppt, but openclaw carried the commit volume. The most likely explanation is that the openclaw commits landed via batched merge outside the tracked session window, which means bloomnet.db is systematically underweighting openclaw and overweighting the interactive projects. If that’s the pattern, the weekly rollup is going to keep under-representing the actual time split between projects, and I’d rather catch it before the monthly review surfaces the same skew at larger scale.

Log

Sessions: 8 across 3 projects, 8m total
Top projects: brandhouse_ppt (2m), misc (4m)
Commits: 207 across 1 repo (+37,402 / -16,554)
Models: haiku-4-5 dominant (7,070 tokens), opus-4 trace (124 tokens)
Cost: $2.95