Definition

Model-Capability Harness

Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.

definitionai-agentsevaluationbenchmarks

Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.

A model-capability harness measures what a language model can do in isolation, with deterministic scoring (exact match, multiple choice, loglikelihood comparison). No LLM judge, no subjective quality assessment. The canonical implementation is EleutherAI’s lm-evaluation-harness, which runs 400+ benchmarks through a 6-stage pipeline: Instance -> Request -> filter -> generate -> filter -> aggregate. Task definitions are YAML files, not code, making new benchmarks declarative.

How It Works

The harness converts each benchmark task into a request type (loglikelihood, generate_until, loglikelihood_rolling, multiple_choice), dispatches to the model, applies filter ensembles to normalize outputs, and aggregates scores. Key architectural insight: decoupling task specification from model execution means the same task definition works across any model backend.

Signal Properties

  • Deterministic: no variance between runs (same model + same prompt = same score)
  • Reproducible: sealed datasets with known ground truth
  • Narrow: measures capability at one task, not application fitness
  • Composable: individual benchmark scores combine into capability profiles

When to Use

Use when you need to compare raw model performance, validate that a fine-tune didn’t regress, or select between models for a specific capability (coding, math, reasoning). Not appropriate for measuring application-level quality where “correct” depends on context.

When to Avoid

Not appropriate for application-level quality assessment (use application-quality harness) or adversarial discrimination testing (use falsification harness). Benchmarks saturate as models improve; a 99% MMLU score tells you less than a 70% score did.

Exemplar

lm-evaluation-harness (EleutherAI): 400+ tasks, YAML-as-task-definition, 6-stage request pipeline, 4 request types, filter ensembles. See research/2026-04-27-lm-evaluation-harness-architecture.

Vault Instance

Oil model’s LOO recalibration: 7 parameters ratcheted against deterministic prediction-accuracy metrics (R-squared, MAPE). The metric is deterministic and sealed (historical price data), matching the model-capability archetype.

Key Benchmarks

MMLU (knowledge), HumanEval (code), GSM8K (math), ARC (reasoning), HellaSwag (commonsense), TruthfulQA (factuality), MATH (competition math), BBH (beyond-imitation), WinoGrande (coreference). All available via lm-evaluation-harness.