Model-Capability Harness
Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.
Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.
A model-capability harness measures what a language model can do in isolation, with deterministic scoring (exact match, multiple choice, loglikelihood comparison). No LLM judge, no subjective quality assessment. The canonical implementation is EleutherAI’s lm-evaluation-harness, which runs 400+ benchmarks through a 6-stage pipeline: Instance -> Request -> filter -> generate -> filter -> aggregate. Task definitions are YAML files, not code, making new benchmarks declarative.
How It Works
The harness converts each benchmark task into a request type (loglikelihood, generate_until, loglikelihood_rolling, multiple_choice), dispatches to the model, applies filter ensembles to normalize outputs, and aggregates scores. Key architectural insight: decoupling task specification from model execution means the same task definition works across any model backend.
Signal Properties
- Deterministic: no variance between runs (same model + same prompt = same score)
- Reproducible: sealed datasets with known ground truth
- Narrow: measures capability at one task, not application fitness
- Composable: individual benchmark scores combine into capability profiles
When to Use
Use when you need to compare raw model performance, validate that a fine-tune didn’t regress, or select between models for a specific capability (coding, math, reasoning). Not appropriate for measuring application-level quality where “correct” depends on context.
When to Avoid
Not appropriate for application-level quality assessment (use application-quality harness) or adversarial discrimination testing (use falsification harness). Benchmarks saturate as models improve; a 99% MMLU score tells you less than a 70% score did.
Exemplar
lm-evaluation-harness (EleutherAI): 400+ tasks, YAML-as-task-definition, 6-stage request pipeline, 4 request types, filter ensembles. See research/2026-04-27-lm-evaluation-harness-architecture.
Vault Instance
Oil model’s LOO recalibration: 7 parameters ratcheted against deterministic prediction-accuracy metrics (R-squared, MAPE). The metric is deterministic and sealed (historical price data), matching the model-capability archetype.
Key Benchmarks
MMLU (knowledge), HumanEval (code), GSM8K (math), ARC (reasoning), HellaSwag (commonsense), TruthfulQA (factuality), MATH (competition math), BBH (beyond-imitation), WinoGrande (coreference). All available via lm-evaluation-harness.
Related
- definitions/eval-harness-archetype: parent taxonomy
- definitions/application-quality-harness: contrast: measures application output, not model capability
- definitions/falsification-harness: contrast: measures discrimination, not production
- research/2026-04-27-lm-evaluation-harness-architecture: deep-dive on the architecture