Experiments

Hypothesis-driven tests with systematic execution. Each experiment documents what was tried, what surprised us, and what to try next.

Confirmed

67% success rate

Refuted

Pending

Total

across 14 projects

bloomnet (9)

Experiment confirmed bloomnet May 4, 2026

Guardrail architectures combining hard-block hooks with memory rules a

hook-hard-block + memory-rule achieves 100% reduction (N=1). Memory-rule-only achieves 33% resolution rate (1/3). 93% of

Experiment pending bloomnet Apr 23, 2026

Adding Distribution and Semantic-Builders leagues plus factoring Matur

Experiment confirmed bloomnet Apr 19, 2026

Four-league structure (Harness/Vault/Skills/Tools) with one subagent p

57 scorecards across 32 systems. Rate-limit pitfall discovered: dispatching 53 subagents in one wave produced 0/53 score

Experiment confirmed bloomnet Apr 7, 2026

Building our own frame-graph storage in SQLite with tantivy BM25 and H

Approach C delivered: 711 frames indexed, hybrid search working, 72 scanners running, 86% audit pass rate on first run.

Experiment pending bloomnet Apr 3, 2026

Applying the Karpathy LLM Knowledge Base compilation pattern to a subs

Experiment pending bloomnet Apr 3, 2026

A rolling z-score anomaly detector (2-sigma threshold) on Claude Code

Experiment confirmed bloomnet Mar 24, 2026

Puppeteer-driven screenshot capture loops will catch visual regression

Screenshot automation catches rendering bugs that unit tests miss. 2x retina resolution reveals sub-pixel issues. Rollin

Experiment confirmed bloomnet Mar 19, 2026

A 4-source ingestion pipeline (JSONL sessions, stats-cache, history.js

Pipeline operational with all 4 sources. Sub-project merging eliminates double-counting. Proportional distribution preve

Experiment confirmed bloomnet Mar 14, 2026

Canvas 2D rendering will provide better performance and simpler archit

Canvas 2D eliminated WebGL context issues, simplified the rendering pipeline, and enabled L-system botanical encoding th

jobs-apply (23)

Experiment inconclusive jobs-apply Apr 27, 2026

Deploying 10 targeted fixes across easy-apply.ts, linkedin-adapter.ts,

6/6 = 100% after select fix deployed. 9/13 total for the day (3/7 before fixes, 6/6 after). From Run 1 (40%) to Run 10 (

Experiment confirmed jobs-apply Apr 1, 2026

Proactive modal scrolling and CDP select placeholder detection will re

2/3 = 67% during quiet hours. 1 failure: 240s timeout on screening question dropdown (Databricks/Pyspark). CDP select de

Experiment confirmed jobs-apply Mar 31, 2026

Fixing silent modal failures (F40), verification false negatives (F41)

All three failure patterns addressed and deployed. F40 scroll fix prevents modal-bottom submit button from being out of

Experiment inconclusive jobs-apply Mar 28, 2026

A 6-subagent Karpathy ratchet targeting interview conversion rate can

All 6 subagents complete. LinkedIn scan revealed 0 interview signals from 105 submitted applications. All 12 recruiter m

Experiment confirmed jobs-apply Mar 28, 2026

A click-and-verify loop that checks modal state after each click strat

5/6 = 83% success. All 6 attempts opened modal on strategy 1 (first try). 1 failure: Save Application dialog blocked adv

Experiment confirmed jobs-apply Mar 28, 2026

Database-backed company intelligence with priority scoring improves Di

11 new table columns. Priority scoring operational. Direct channel submission rate stable at 77.9% (113/145). Cross-chan

Experiment confirmed jobs-apply Mar 27, 2026

Event-driven A/B testing with auto-promotion will produce measurable c

System fully operational. 18 events verified in Neon via browser test. 8 experiments running. Consent gate unified. Thre

Experiment confirmed jobs-apply Mar 25, 2026

A three-layer scoring system (semantic embeddings + structured feature

592 tests passing. Score audit table logs ALL scores including sub-70 rejections. 3,742 legacy scores backfilled from ar

Experiment confirmed jobs-apply Mar 24, 2026

A multi-iteration anti-detection suite (gaussian timing, reading simul

Account unrestricted since 2026-03-26. Run 7 (2026-03-29) achieved 83% with all anti-detection measures active. P0-P1 (i

Experiment confirmed jobs-apply Mar 21, 2026

Connecting to the user's real Chrome via CDP will be more resistant to

CDP mode disables ALL stealth scripts, relying entirely on real Chrome session. 22-38s click gaps were pure LLM inferenc

Experiment confirmed jobs-apply Mar 14, 2026

Systematic fix of individual failure points will drive LinkedIn Easy A

7 runs, 40% to 83% final. 26 individual failure fixes (F1-F39). Account restriction in Run 6 was the critical learning.

Experiment confirmed jobs-apply Mar 8, 2026

Deploying 4 parallel Claude Code agents in isolated git worktrees can

679 sessions (522 main + 157 worktree) in 2 days. Agents successfully parallelized adapter work. Merged into baseline co

Experiment confirmed jobs-apply Feb 26, 2026

A provider-transparent rate limiter keyed by API key + provider can pr

Zero consumer changes needed. Rate limiting invisible inside provider layer. Prevented cost blowouts during early develo

Experiment confirmed jobs-apply Feb 25, 2026

Gemini Flash vision can analyze ATS page screenshots to extract form s

Vision approach worked for page understanding but was eventually replaced by DOM-based extraction for form filling. Visi

dakka (6)

Experiment confirmed dakka Apr 22, 2026

Playwright specs covering spawn/dismiss, avatars, MQI, and responsiven

16/16 pass (14 + 2 conditional skips). Two product bugs found and fixed: destroy-rebuild in MascotBar races Playwright l

Experiment pending dakka Apr 3, 2026

Merge/review time as a fraction of total session time grows superlinea

Experiment confirmed dakka Apr 1, 2026

Rewriting core orchestration in Rust (3 crates) will provide better pe

14.8k total lines of Rust. All 3 crates compile clean. Ownership model eliminates PTY race conditions by design. Phase 4

Experiment confirmed dakka Mar 28, 2026

A monorepo architecture with 4 packages (shared, gork, mork, shell) co

Full terminal fidelity achieved via PTY-based execution. XState v5 state machines manage agent lifecycle. WebSocket bina

Experiment confirmed dakka Mar 28, 2026

+/- buttons with fixed 4-role spawn order will provide predictable war

Fixed spawn order eliminates race conditions. Users can dynamically scale from 1-4 agents. Each role has distinct capabi

Experiment confirmed dakka Mar 9, 2026

A fixed spawn order with 4 roles (orchestrator, observer, worker, spec

Fixed spawn order eliminates race conditions in process initialization. WS protocol enables real-time status updates. UI

media-diet (2)

Experiment pending media-diet Apr 16, 2026

The Meta Graph API plus a Google Takeout style archive import can prod

Experiment pending media-diet Apr 16, 2026

The YouTube Data API combined with the Google Takeout watch-history.js

apple-photos (1)

Experiment pending apple-photos Apr 13, 2026

Apple's pre-computed ML metadata (scenes, aesthetics, OCR) on personal

oil (7)

Experiment confirmed oil Apr 13, 2026

Replacing uniform Brier reference ($90-$220) with market-implied midpo

260414b final. Three-layer fix: (1) Calibration-proper scoring eliminated Goodhart gaming (calPen 0.216->0.060). (2) Pro

Experiment confirmed oil Mar 17, 2026

A 3-round quantitative audit with one-step-ahead forecasting, multi-se

R-squared=0.9752, MAPE=1.49%, OOS R-squared=0.9755. One-step-ahead: MAE=$0.833, RMSE=$1.198, direction accuracy=93.8%. B

Experiment confirmed oil Mar 17, 2026

An 11-trigger sell model with conviction hold override plus refined de

Lowest one-step MAE: $0.833 (13% better than v14b). Direction accuracy 93.8% (15/16). Sell model: 100% trigger accuracy,

Experiment confirmed oil Mar 17, 2026

An 8-layer geopolitical consensus pipeline with tier-based parameter c

v17.2 current. Cycle 3 example: R² 0.9610→0.9769 (+0.0159), MAPE 1.85%→1.26% (-0.59pp). 12 parameters accepted, 3 reject

Experiment confirmed oil Mar 16, 2026

Autoregressive error correction and spike reversion mechanics will eli

R² 0.9684 (+0.68pp). MAPE 1.57% (-0.32pp). One-step MAE $1.09 (-$0.54, 33% improvement). Direction accuracy 86.7%→93.3%

Experiment confirmed oil Mar 16, 2026

Sigmoid probability blending (vs binary threshold) combined with expan

MAE $0.951 (42% improvement from v13's $1.633). Autocorrelation 0.1651→-0.001 (eliminated). OOS R² 0.9699. P($150) eleva

Experiment confirmed oil Mar 15, 2026

Tail risk is vastly underestimated: adding historical analogues and de

R² 0.9498→0.9616 (+1.18pp). MAPE 1.97%→1.89%. OOS R² 0.8633→0.9131 (+4.98pp). P($120+) 15.4%→36.4%. Historical pass rate

email-voice (1)

Experiment pending email-voice Apr 12, 2026

Email voice is distinct from social-media voice. The same person write

brand-voice (1)

Experiment confirmed brand-voice Apr 11, 2026

A computable voice profile distilled from a decade of tweets can produ

Voice profile distilled 8 measurable dimensions from 1,149 engagement-filtered tweets. Multi-agent pipeline (3 Sonnet ge

peon-notify (6)

Experiment pending peon-notify Apr 3, 2026

Scaling codeguard timeout dynamically based on file count and complexi

Experiment pending peon-notify Apr 3, 2026

A ShellCheck + custom grep linter pass on all peon-notify hook scripts

Experiment pending peon-notify Apr 3, 2026

A three-stage retrieval pipeline (BM25 keyword filter -> embedding rer

Experiment confirmed peon-notify Mar 27, 2026

Normalizing vault paths in peon.json and fixing the _obsidian_dir() he

Orphaned notes eliminated. All session-generated notes now land in correct vault directories. The _obsidian_dir() mappin

Experiment confirmed peon-notify Mar 24, 2026

Scanning ~/.claude/projects/*.jsonl and indexing historical sessions w

286 sessions indexed from JSONL backfill. Cursor-based processing enables incremental updates without reprocessing. Sess

Experiment confirmed peon-notify Mar 19, 2026

Automated lint+review on every file write will catch issues before the

Per-write processing was too expensive (confirmed hypothesis partially wrong). Accumulate-then-flush pattern solved it:

context-curator (1)

Experiment confirmed context-curator Apr 1, 2026

A version-pinned sync mechanism between vault skills and mounted SKILL

First sync remediated 27 skills: 24 version drift (mounted had no version field), 3 unmounted (vault-only), 1 orphan (bl

Kiro CLI Factory (1)

Experiment confirmed Kiro CLI Factory Feb 28, 2026

A bash loop invoking Claude Code per-spec could autonomously build a m

Autonomous spec-driven build works but requires single-spec scoping, error tolerance (no set -e), and escape hatches for

investor-research (1)

Experiment confirmed investor-research Jan 27, 2026

A research pipeline will naturally demand model escalation as easy tar

Jan 28: 569 min, $3.09, cheap-model territory (long background fetches, minimal reasoning). Jan 29-30: session count pea

openclaw (2)

Experiment confirmed openclaw Jan 14, 2026

The plugin architecture and module boundaries from the Jan 11 migratio

PR #991 merged same-cycle with zero core patches required. The Zalo messaging adapter: a protocol most North American de

Experiment confirmed openclaw Jan 9, 2026

An OpenAI chat-completions compatible HTTP endpoint can serve as openc

308 commits in 60 minutes across 4 sessions ($2.76). The gateway shipped and was immediately exercised by the sandbox ex

OpenClaw (3)

Experiment confirmed OpenClaw Dec 9, 2025

A native macOS desktop app wrapping WhatsApp Web will be more reliable

Native app achieved 99.2% uptime over 14 days vs 91.4% for headless. Eliminated 3 failure classes: GPU rendering crashes

Experiment confirmed OpenClaw Nov 27, 2025

Periodic heartbeat checks against the paired phone will predict sessio

Heartbeat detected phone-offline state an average of 6.2 hours before token expiration. Zero false negatives over 21-day

Experiment confirmed OpenClaw Nov 14, 2025

Automatic session recovery after WhatsApp disconnection will reduce ma

Auto-recovery handled 94% of disconnections without human intervention. Remaining 6% were auth token expirations requiri

Experiments

bloomnet (9)

Guardrail architectures combining hard-block hooks with memory rules a

Adding Distribution and Semantic-Builders leagues plus factoring Matur

Four-league structure (Harness/Vault/Skills/Tools) with one subagent p

Building our own frame-graph storage in SQLite with tantivy BM25 and H

Applying the Karpathy LLM Knowledge Base compilation pattern to a subs

A rolling z-score anomaly detector (2-sigma threshold) on Claude Code

Puppeteer-driven screenshot capture loops will catch visual regression

A 4-source ingestion pipeline (JSONL sessions, stats-cache, history.js

Canvas 2D rendering will provide better performance and simpler archit

jobs-apply (23)

ToFu visitors on About are ready for BoFu action without more mid-funn

Changelog readers who see shipping momentum respond to urgency framing

Contact page visitors respond better to direct download than gentle fu

Concrete metrics build more credibility than broad feature claims

More frequent CTAs mid-scroll catch visitors before bounce, shorter pa

Leading with download CTA converts higher-intent visitors who'd bounce

Lower price anchor reduces sticker shock, action-oriented copy outperf

Trust-primed visitors who read the full security page are ready to con

Extracting the gaussian behavioral timing module from jobs-apply into

Deploying 10 targeted fixes across easy-apply.ts, linkedin-adapter.ts,

Proactive modal scrolling and CDP select placeholder detection will re

Fixing silent modal failures (F40), verification false negatives (F41)

A 6-subagent Karpathy ratchet targeting interview conversion rate can

A click-and-verify loop that checks modal state after each click strat

Database-backed company intelligence with priority scoring improves Di

Event-driven A/B testing with auto-promotion will produce measurable c

A three-layer scoring system (semantic embeddings + structured feature

A multi-iteration anti-detection suite (gaussian timing, reading simul

Connecting to the user's real Chrome via CDP will be more resistant to

Systematic fix of individual failure points will drive LinkedIn Easy A

Deploying 4 parallel Claude Code agents in isolated git worktrees can

A provider-transparent rate limiter keyed by API key + provider can pr

Gemini Flash vision can analyze ATS page screenshots to extract form s

dakka (6)

Playwright specs covering spawn/dismiss, avatars, MQI, and responsiven

Merge/review time as a fraction of total session time grows superlinea

Rewriting core orchestration in Rust (3 crates) will provide better pe

A monorepo architecture with 4 packages (shared, gork, mork, shell) co

+/- buttons with fixed 4-role spawn order will provide predictable war

A fixed spawn order with 4 roles (orchestrator, observer, worker, spec

media-diet (2)

The Meta Graph API plus a Google Takeout style archive import can prod

The YouTube Data API combined with the Google Takeout watch-history.js

apple-photos (1)

Apple's pre-computed ML metadata (scenes, aesthetics, OCR) on personal

oil (7)

Replacing uniform Brier reference ($90-$220) with market-implied midpo

A 3-round quantitative audit with one-step-ahead forecasting, multi-se

An 11-trigger sell model with conviction hold override plus refined de

An 8-layer geopolitical consensus pipeline with tier-based parameter c

Autoregressive error correction and spike reversion mechanics will eli

Sigmoid probability blending (vs binary threshold) combined with expan

Tail risk is vastly underestimated: adding historical analogues and de

email-voice (1)

Email voice is distinct from social-media voice. The same person write

brand-voice (1)

A computable voice profile distilled from a decade of tweets can produ

peon-notify (6)

Scaling codeguard timeout dynamically based on file count and complexi

A ShellCheck + custom grep linter pass on all peon-notify hook scripts

A three-stage retrieval pipeline (BM25 keyword filter -> embedding rer

Normalizing vault paths in peon.json and fixing the _obsidian_dir() he

Scanning ~/.claude/projects/*.jsonl and indexing historical sessions w

Automated lint+review on every file write will catch issues before the

context-curator (1)

A version-pinned sync mechanism between vault skills and mounted SKILL

Kiro CLI Factory (1)

A bash loop invoking Claude Code per-spec could autonomously build a m

investor-research (1)

A research pipeline will naturally demand model escalation as easy tar

openclaw (2)

The plugin architecture and module boundaries from the Jan 11 migratio

An OpenAI chat-completions compatible HTTP endpoint can serve as openc

OpenClaw (3)

A native macOS desktop app wrapping WhatsApp Web will be more reliable

Periodic heartbeat checks against the paired phone will predict sessio

Automatic session recovery after WhatsApp disconnection will reduce ma