A three-stage retrieval pipeline (BM25 keyword filter -> embedding rerank -> LLM context window) will return relevant vault notes within 2 seconds for 609+ files, making real-time vault search from Claude Code practical
HypothesisA three-stage retrieval pipeline (BM25 keyword filter -> embedding rerank -> LLM context window) will return relevant vault notes within 2 seconds for 609+ files, making real-time vault search from Claude Code practical

Changelog
| Date | Summary |
|---|---|
| 2026-04-06 | Audited: chain updated (iteration 5), domain tag ai-agents, last_audited stamped |
| 2026-04-04 | Initial creation |
Hypothesis
The staged retrieval research documents Jeff Dean’s architecture for narrowing trillion-token corpora to the “right million.” The vault has 609+ files (~500K tokens total). A three-stage pipeline can narrow this to the 5-10 most relevant notes within 2 seconds, solving the problem described in vault search from Claude Code: sessions currently have no access to the 600+ notes in the vault.
The nested CLI and SessionEnd timeout pitfalls constrain the implementation: any hook that shells out to claude -p must unset CLAUDECODE first, and SessionEnd hooks must complete within the configured timeout.
Method
- Stage 1: BM25 keyword index: build an inverted index over all vault frontmatter fields (title, tags, domain, project) + first 200 words of body. Use a lightweight BM25 implementation (lunr.js or equivalent). This narrows 609 files to ~50 candidates in <100ms.
- Stage 2: embedding rerank: compute embeddings for the 50 BM25 candidates using a local model (all-MiniLM-L6-v2, 22M parameters, runs on CPU). Cosine similarity reranking narrows to top 10. Target: <500ms.
- Stage 3: context injection: format top 10 results as a structured context block (title, frontmatter summary, first paragraph) and inject into the Claude Code session via peon-notify hook. Target: <200ms.
- Total latency budget: <2 seconds end-to-end (100ms + 500ms + 200ms + overhead).
- Relevance evaluation: test 20 queries covering each vault dimension. Compare results against manual expert ranking (nDCG@10).
- Integration point: UserPromptSubmit hook in peon-notify. When user prompt contains vault-relevant keywords (detected by a lightweight classifier), trigger the retrieval pipeline and inject results as context.
Results
Pending. Will measure:
- End-to-end latency (p50, p95, p99)
- Relevance score (nDCG@10 against expert rankings)
- Stage 1 recall (what percentage of relevant notes survive BM25 filtering)
- CPU/memory overhead of the embedding model
Findings
Pending.
Next Steps
If latency and relevance targets are met, deploy as a production hook. Consider building a persistent index that updates incrementally (content-hash-dedup pattern) rather than rebuilding on every query. Long-term: this could power the MCP-based vault search tool described in the source idea.