Gemini Flash vision can analyze ATS page screenshots to extract form structure and job details without DOM parsing
Vision approach worked for page understanding but was eventually replaced by DOM-based extraction for form filling. Vision remained useful for fallbac
HypothesisGemini Flash vision can analyze ATS page screenshots to extract form structure and job details without DOM parsing
Vision approach worked for page understanding but was eventually replaced by DOM-based extraction for form filling. Vision remained useful for fallback analysis of non-standard ATS layouts. The OpenRouter provider with vision support became the standard AI integration pattern.

Changelog
| Date | Summary |
|---|---|
| 2026-04-07 | Created during temporal gap audit |
| 2026-02-26 | Original experiment |
Hypothesis
Gemini Flash vision can analyze screenshots of Applicant Tracking System (ATS) pages to extract form structure and job details, bypassing the need for DOM parsing of the diverse and frequently-changing ATS implementations out there. The intuition: every ATS looks different in the DOM but looks reasonably consistent to a human eye. If vision could match human recognition for “this is the job title, this is the apply button, this is a required field,” then the engine could treat ATS variance as a perception problem rather than a parser problem.
Method
I built an OpenRouter provider with vision support, then wrote test-hunt.ts as the first end-to-end pipeline probe. The flow: discover jobs via LinkedIn search, navigate to the ATS page, take a full-page screenshot, feed the screenshot to Gemini Flash with a structured-output prompt asking for job title, company, location, required fields, and submit-button location, then AI-match the extracted details against a candidate profile before deciding to proceed.
The prompt was tuned across maybe 20 trial ATS pages spanning Workday, Greenhouse, Lever, iCIMS, and a handful of direct careers pages. Structured output (JSON mode with a strict schema) kept the model from hallucinating fields that did not exist.
Results
Vision worked for page understanding. The model reliably extracted job titles, locations, and required fields from screenshots. Match quality against the candidate profile was good enough to gate submissions with acceptable false-positive rates.
For interactive form filling, DOM-based extraction proved more reliable. The reason is cost and latency more than capability: vision round-trips are expensive per field, while DOM queries are cheap and deterministic. Once you know a form has ten fields, calling the vision model ten times to fill them is the wrong tool. Vision remained as a fallback for non-standard layouts that resist DOM parsing.
Findings
Vision is the right tool for page-level understanding (classify this page, find the apply button, extract the job title once). DOM parsing is the right tool for field-level interaction (fill ten fields across three pages). The division of labor that fell out of this experiment held through the next six weeks of iteration.
The OpenRouter provider with multi-modal support (text plus vision) became the standard AI integration pattern in jobs-apply. The test-hunt.ts pipeline structure (discover, navigate, analyze, match) became the engine’s core loop. Every later adapter followed the same four-stage shape.
Next Steps
Push on the DOM-extraction side first, because that is where volume comes from. Reserve vision for the hard cases: non-standard layouts, iframe-heavy ATS implementations, and visual confirmation of final-page state.
Source
test-hunt.ts pipeline probe in the jobs-apply repository, 2026-02-26.