Experiment Preferences jobs-apply

Gemini Flash vision can analyze ATS page screenshots to extract form structure and job details without DOM parsing

Vision approach worked for page understanding but was eventually replaced by DOM-based extraction for form filling. Vision remained useful for fallbac

February 25, 2026

aivisionatscareer

Hypothesis

Gemini Flash vision can analyze ATS page screenshots to extract form structure and job details without DOM parsing

Result: confirmed

Key Findings

Vision approach worked for page understanding but was eventually replaced by DOM-based extraction for form filling. Vision remained useful for fallback analysis of non-standard ATS layouts. The OpenRouter provider with vision support became the standard AI integration pattern.

Changelog

Date	Summary
2026-04-07	Created during temporal gap audit
2026-02-26	Original experiment

Hypothesis

Gemini Flash vision can analyze screenshots of Applicant Tracking System (ATS) pages to extract form structure and job details, bypassing the need for DOM parsing of the diverse and frequently-changing ATS implementations out there. The intuition: every ATS looks different in the DOM but looks reasonably consistent to a human eye. If vision could match human recognition for “this is the job title, this is the apply button, this is a required field,” then the engine could treat ATS variance as a perception problem rather than a parser problem.

Method

I built an OpenRouter provider with vision support, then wrote test-hunt.ts as the first end-to-end pipeline probe. The flow: discover jobs via LinkedIn search, navigate to the ATS page, take a full-page screenshot, feed the screenshot to Gemini Flash with a structured-output prompt asking for job title, company, location, required fields, and submit-button location, then AI-match the extracted details against a candidate profile before deciding to proceed.

The prompt was tuned across maybe 20 trial ATS pages spanning Workday, Greenhouse, Lever, iCIMS, and a handful of direct careers pages. Structured output (JSON mode with a strict schema) kept the model from hallucinating fields that did not exist.

Results

Vision worked for page understanding. The model reliably extracted job titles, locations, and required fields from screenshots. Match quality against the candidate profile was good enough to gate submissions with acceptable false-positive rates.

For interactive form filling, DOM-based extraction proved more reliable. The reason is cost and latency more than capability: vision round-trips are expensive per field, while DOM queries are cheap and deterministic. Once you know a form has ten fields, calling the vision model ten times to fill them is the wrong tool. Vision remained as a fallback for non-standard layouts that resist DOM parsing.

Findings

Vision is the right tool for page-level understanding (classify this page, find the apply button, extract the job title once). DOM parsing is the right tool for field-level interaction (fill ten fields across three pages). The division of labor that fell out of this experiment held through the next six weeks of iteration.

The OpenRouter provider with multi-modal support (text plus vision) became the standard AI integration pattern in jobs-apply. The test-hunt.ts pipeline structure (discover, navigate, analyze, match) became the engine’s core loop. Every later adapter followed the same four-stage shape.

Next Steps

Push on the DOM-extraction side first, because that is where volume comes from. Reserve vision for the hard cases: non-standard layouts, iframe-heavy ATS implementations, and visual confirmation of final-page state.

Source

test-hunt.ts pipeline probe in the jobs-apply repository, 2026-02-26.