Journal

Karpathy ratchet: empirical iteration beats upfront tuning without ground truth

March 26, 2026

data-engcareerai-agentsreconstructed-from-memory

2026-03-27

Signal

The Karpathy ratchet approach to data quality optimization: choosing empirical iteration over upfront parameter tuning: is the right choice when ground truth is unavailable, because it turns the absence of ground truth from a blocker into a constraint that guides the methodology.

Evidence

Project: internal audit: Two design documents created: AGENTIC-DATA-QUALITY-OPTIMIZATION.md (10-iteration audit from SWE/DE/DS perspectives) and IMPLEMENTATION-PLAN.md (25 identified weaknesses, fixes from 5 perspectives)
Pipeline weaknesses identified: no ground truth dataset, no regression tests, static domain config, fiat weights (weights set by opinion, not empirical evidence)
Choice made: Karpathy ratchet: empirical optimization with a metric that only goes up; golden dataset + evaluator + ratchet config extraction as P0
Project: projects/jobs-apply/_index: Gmail OAuth setup complete: Google Cloud project configured, desktop OAuth flow, AES-256-GCM encrypted token storage, scopes: gmail.readonly + gmail.send, tokens in oauth_tokens SQLite table with auto-refresh

So What (Why Should You Care)

The Karpathy ratchet is a methodology for systems where you can measure quality but can’t define ground truth in advance. Traditional ML uses labeled datasets: you know the right answer for each training example, and you measure how close your model gets. But in data quality auditing, you rarely have labels. There’s no gold standard dataset of “these events are definitely high quality and these are definitely not.” You have a scoring function and you have outputs: and you need to optimize the scoring function without being able to check it against known-correct answers.

The ratchet approach reframes this as a constraint rather than a blocker. Define a metric that functions as a reliable proxy for quality: one where improvements in the metric correspond to genuine improvements in the underlying thing you care about. Then make the metric monotonic: it can only go up, never regress. Run experiments, keep changes that improve the metric, discard changes that don’t. Over many iterations, you accumulate genuine improvements rather than circular exploration.

The key design question is: what makes a metric a valid proxy? For internal audit, the DQI was initially tautological: Pearson correlation on the same scores it was derived from. That’s not a proxy; it’s a tautology. The rebuilt evaluator (6 objective verifiable properties) is a proxy: it measures things that are genuinely correlated with quality (event descriptions exist, dates are valid, place relevance is present) without being derived from the scores themselves.

The pipeline weaknesses identified today: no ground truth dataset, no regression tests, static domain config, fiat weights: are all symptoms of the same root cause: the system was built to produce scores, not to verify that those scores are correct. Adding the ratchet infrastructure (evaluator, golden dataset, config extraction) transforms it from a scoring system into a self-improving scoring system. That’s the architectural shift that makes the P0 work the highest-leverage work.

The Gmail OAuth setup in projects/jobs-apply/_index today is a prerequisite for closing the interview feedback loop: the same feedback loop problem identified on 2026-03-29. You can’t read Gmail responses without OAuth. You can’t close the feedback loop without reading Gmail. The Gmail setup is infrastructure work for a behavioral change: making the pipeline outcome-aware rather than activity-aware.

What’s Next

Begin Karpathy ratchet implementation: golden dataset creation, evaluator build, ratchet config extraction (starts 2026-03-28)

Log

internal audit: two design documents created in internal audit pipeline/docs/
AGENTIC-DATA-QUALITY-OPTIMIZATION.md: 10-iteration audit from SWE/DE/DS perspectives
IMPLEMENTATION-PLAN.md: 25 identified weaknesses, fixes from 5 perspectives
Pipeline weaknesses: no ground truth dataset, no regression tests, static domain config, fiat weights
Karpathy ratchet approach chosen for empirical optimization
P0 defined: golden dataset + evaluator + ratchet config extraction
projects/jobs-apply/_index: Gmail OAuth setup
Google Cloud project configured for Gmail API access
Desktop OAuth flow with AES-256-GCM encrypted token storage
Scopes: gmail.readonly + gmail.send
Tokens stored in oauth_tokens SQLite table with auto-refresh
Setup command: npx tsx scripts/oauth-setup.ts gmail