Skill

j-score-v2-matching

March 30, 2026

mlcareerpatternbreakthrough

Trigger

User needs to score how well a candidate matches a job posting using multi-layer fusion scoring

Version: 260420

Changelog

260420: multiple edits

v_migrate: Changelog migrated from table to YYMMDD H3 format per versioning-standard rule 2 (V1.6 of skills upgrade plan)
v6: Added license, sources per V6.1/V6.2 of skills upgrade plan.
v1.5: Added ## Quality Checks section per V1.5 of ~/vault/plans/2026-04-20-vault-skills-upgrade-plan.md

260403: Added Visual Enrichment section + self-improving-agent-patterns cross-reference

Three-layer fusion scoring engine that replaces a legacy single-LLM scorer (which had 17 identified weaknesses: no semantic understanding, non-deterministic output, zero feedback loop, no skills graph, no score calibration). The core insight is that no single scoring method is sufficient for job matching: deterministic features miss semantic nuance, embeddings miss structured constraints, and LLMs are expensive and unreliable alone. Fusing all three, with confidence-weighted LLM contribution, produces scores that are both cheap and accurate.

Deployed 2026-03-26 in projects/jobs-apply/_index. All scores (including sub-70 rejections) are logged to the score_audit table for future topics/confidence-weighted-score-fusion calibration and topics/interview-rate-optimization feedback loops.

Layer 1: Semantic Embedder (30-50ms, $0): Xenova/all-MiniLM-L6-v2, 22MB quantized, 384-dim. Embeds job text and candidate text, cosine similarity mapped to 0-100. Catches semantic matches that keyword overlap misses (“ML Engineer” clusters near “AI Researcher”). Lazy singleton; graceful fallback to 50 (neutral) if load fails.

Layer 2: Structured Features (<1ms, $0): Five weighted sub-dimensions: Skills overlap (0.35, taxonomy-based: exact=1.0, sibling=0.8, parent=0.6, 10 categories, 200+ skills), Experience alignment (0.20), Location compatibility (0.20, metro area grouping), Salary fit (0.15), Tech stack alignment (0.10). Returns matchedSkills[] and skillGaps[] forwarded to Layer 3.

Layer 3: Enhanced LLM (~2s, ~$0.001): DeepSeek v3 via OpenRouter. Gated: only runs when Layer 2 >= 35 (validated on 2,874 historical jobs, zero false negatives, saves ~60% API cost). Receives Layer 2 sub-scores and skill gaps as context. Scores cultural fit, career trajectory, hidden requirements, domain nuance, and red flags. Returns score, confidence (low/medium/high), reasoning, key_matches[], concerns[]. Fails gracefully with weight redistribution.

Fusion: finalScore = 0.35 * L1 + 0.25 * L2 + 0.40 * L3. Confidence adjustment: low = 0.6x LLM weight, medium = 1.0x, high = 1.2x (before renormalization). No-LLM fallback: 0.55 / 0.45 semantic/structured. Floor: 70+ to proceed to tailoring.

Interface

import { JScoreV2, type JScoreOptions } from '@jobs-apply/ai';

const scorer = new JScoreV2(openRouterProvider, {
  weights: { semantic: 0.35, structured: 0.25, llm: 0.40 },
  llmThreshold: 35,
  skipSemantic: false,
  skipLlm: false,
});

// Single job
const result = await scorer.scoreJob(job, profile);
// result: { score, semantic, structured, llm, llmInvoked, reasoning, key_matches, concerns, weights, usage }

// Batch (Layer 1+2 for all, Layer 3 only for above-threshold)
const results = await scorer.batchScore(jobs, profile);
// results: Map<jobUrl, JScoreResult>

Audit logging is handled by ScoreAuditRepository.log() which persists every score (including rejections) with full layer breakdown, outcome tracking, and 5-point bucket distribution for calibration analysis.

Five-layer enforcement ensures no sub-70 job is ever submitted:

Config: server.ts sets minMatchScore: 70
Hard floor: match.ts exports MATCH_SCORE_FLOOR = 70, clamps constructor arg
Auto-approve gate: channel-run-loop.ts checks score before auto-approving
Submit guard: submit.ts rejects sub-floor scores pre-submit
Junk filter: discover.ts regex patterns block non-job listings

Provenance

Origin: 17-weakness audit (W1-W17) of the legacy single-LLM scorer from the archived autosearch system. Key failures: no semantic pre-filter caused false negatives on semantically-similar titles, non-deterministic LLM-only scoring produced inconsistent results across runs, zero feedback loop meant no learning from outcomes.

Historical validation: 3,742 legacy scores backfilled into score_audit from the archived system during the data import (2026-03-26). LLM invocation threshold of 35 validated against 2,874 of those historical jobs with zero false negatives.

Test coverage: 592 tests passing across the monorepo (vitest, 49 files). Backward compatible: legacy JobMatcher still works if setJScoreV2() is not called; computeHeuristicScore() still exported for tests.

Deployed: 2026-03-26. Integrated into the unified pipeline (DISCOVER -> MATCH -> TAILOR -> SUBMIT -> TRACK) where every cycle runs all 5 stages.

Usage Notes

Embedder warmup: Call warmupEmbedder() at server startup to avoid cold-start latency on the first job. The model is ~22MB and takes 1-2s to load.
Cost gate is the key design lever. The LLM threshold (default 35) controls the cost/accuracy tradeoff. Raising it saves money but risks false negatives on jobs where the LLM would have scored high despite low structured overlap. The current threshold was empirically validated.
Confidence adjustment prevents LLM hallucination from dominating. When the LLM returns low confidence (sparse job description, unclear requirements), its 0.40 weight drops to an effective 0.24 after renormalization. This is critical: without it, a hallucinated high score from a vague job posting could push a bad match over the 70 threshold.
Skill gaps flow end-to-end. structuredScore.skillGaps[] feeds into the LLM prompt (so it can reason about whether gaps are dealbreakers) and into the final concerns[] output (so the tailoring stage knows what to address in the cover letter).
Score audit enables closed-loop optimization. The score_audit table with outcome tracking (via updateOutcome()) and getScoreDistribution() provides the data for future score calibration: correlating J-Score predictions with actual interview/rejection outcomes from Gmail scanning.
Batch mode for discovery phases. batchScore() runs Layers 1+2 for all jobs first, then batches Layer 3 calls in chunks of 5 for above-threshold jobs. This is more efficient than scoring one job at a time during the DISCOVER phase.

Key files: packages/ai/src/scoring/score-fusion.ts (orchestrator + fusion), semantic-embedder.ts (Layer 1), structured-scorer.ts (Layer 2), enhanced-llm-scorer.ts (Layer 3), skills-taxonomy.ts (200+ skills), packages/database/src/repositories/score-audit.ts (audit trail), packages/engine/src/pipeline/match.ts (pipeline integration + floor enforcement).

Quality Checks

All 3 layers fire for scores above LLM threshold. score_audit row has non-null semantic, structured, llm for every row with structured >= 35.
Composite score in [0, 100]. SELECT MIN(score), MAX(score) FROM score_audit WHERE created > today returns values in range.
Every invocation writes score_audit. Including rejections (score < 70). SELECT COUNT(*) matches upstream invocation count.
LLM confidence weighting applied. Low-confidence LLM results show effective weight ~0.24 in audit breakdown; high ~0.48.
5-layer floor enforcement. Config = 70, MATCH_SCORE_FLOOR = 70, auto-approve checks score, submit guard rejects, junk filter blocks non-jobs. grep -r 'minMatchScore\|MATCH_SCORE_FLOOR' packages/ shows all 5.
Batch mode layering. batchScore() runs L1+L2 for all jobs first, then L3 only for above-threshold; no mixed ordering.

Visual Enrichment

Medium	Type	Description
R	`DST` histogram	Score distribution
R	`COR` scatter+trend	Semantic vs structured vs composite
Figma	Flowchart	3-layer fusion: semantic -> structured -> LLM -> composite

Self-Improvement Cross-Reference

Pattern 3 (Metric Ratchet): scoring weights were tuned via karpathy-ratchet iterations. For the master reference on all 6 self-improvement patterns, see skills/self-improving-agent-patterns.