Definition

Model-Capability Harness

Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.

April 26, 2026

definitionai-agentsevaluationbenchmarks

Eval archetype that measures raw model abilities using deterministic scoring. Think SAT for language models.

A model-capability harness measures what a language model can do in isolation, with deterministic scoring (exact match, multiple choice, loglikelihood comparison). No LLM judge, no subjective quality assessment. The canonical implementation is EleutherAI’s lm-evaluation-harness, which runs 400+ benchmarks through a 6-stage pipeline: Instance -> Request -> filter -> generate -> filter -> aggregate. Task definitions are YAML files, not code, making new benchmarks declarative.

How It Works

The harness converts each benchmark task into a request type (loglikelihood, generate_until, loglikelihood_rolling, multiple_choice), dispatches to the model, applies filter ensembles to normalize outputs, and aggregates scores. Key architectural insight: decoupling task specification from model execution means the same task definition works across any model backend.

Signal Properties

Deterministic: no variance between runs (same model + same prompt = same score)
Reproducible: sealed datasets with known ground truth
Narrow: measures capability at one task, not application fitness
Composable: individual benchmark scores combine into capability profiles

When to Use

Use when you need to compare raw model performance, validate that a fine-tune didn’t regress, or select between models for a specific capability (coding, math, reasoning). Not appropriate for measuring application-level quality where “correct” depends on context.

definitions/eval-harness-archetype: parent taxonomy
definitions/application-quality-harness: contrast: measures application output, not model capability
definitions/falsification-harness: contrast: measures discrimination, not production
research/2026-04-27-lm-evaluation-harness-architecture: deep-dive on the architecture

Model-Capability Harness

How It Works

Signal Properties

When to Use

When to Avoid

Exemplar

Vault Instance

Key Benchmarks