Experiment Memory peon-notify

A three-stage retrieval pipeline (BM25 keyword filter -> embedding rerank -> LLM context window) will return relevant vault notes within 2 seconds for 609+ files, making real-time vault search from Claude Code practical

April 3, 2026

searchretrievalvaultai-agents

Hypothesis

A three-stage retrieval pipeline (BM25 keyword filter -> embedding rerank -> LLM context window) will return relevant vault notes within 2 seconds for 609+ files, making real-time vault search from Claude Code practical

Result: pending

Changelog

Date	Summary
2026-04-06	Audited: chain updated (iteration 5), domain tag ai-agents, last_audited stamped
2026-04-04	Initial creation

The staged retrieval research documents Jeff Dean’s architecture for narrowing trillion-token corpora to the “right million.” The vault has 609+ files (~500K tokens total). A three-stage pipeline can narrow this to the 5-10 most relevant notes within 2 seconds, solving the problem described in vault search from Claude Code: sessions currently have no access to the 600+ notes in the vault.

The nested CLI and SessionEnd timeout pitfalls constrain the implementation: any hook that shells out to claude -p must unset CLAUDECODE first, and SessionEnd hooks must complete within the configured timeout.

Method

Stage 1: BM25 keyword index: build an inverted index over all vault frontmatter fields (title, tags, domain, project) + first 200 words of body. Use a lightweight BM25 implementation (lunr.js or equivalent). This narrows 609 files to ~50 candidates in <100ms.
Stage 2: embedding rerank: compute embeddings for the 50 BM25 candidates using a local model (all-MiniLM-L6-v2, 22M parameters, runs on CPU). Cosine similarity reranking narrows to top 10. Target: <500ms.
Stage 3: context injection: format top 10 results as a structured context block (title, frontmatter summary, first paragraph) and inject into the Claude Code session via peon-notify hook. Target: <200ms.
Total latency budget: <2 seconds end-to-end (100ms + 500ms + 200ms + overhead).
Relevance evaluation: test 20 queries covering each vault dimension. Compare results against manual expert ranking (nDCG@10).
Integration point: UserPromptSubmit hook in peon-notify. When user prompt contains vault-relevant keywords (detected by a lightweight classifier), trigger the retrieval pipeline and inject results as context.

Results

Pending. Will measure:

End-to-end latency (p50, p95, p99)
Relevance score (nDCG@10 against expert rankings)
Stage 1 recall (what percentage of relevant notes survive BM25 filtering)
CPU/memory overhead of the embedding model

Findings

Pending.

Next Steps

If latency and relevance targets are met, deploy as a production hook. Consider building a persistent index that updates incrementally (content-hash-dedup pattern) rather than rebuilding on every query. Long-term: this could power the MCP-based vault search tool described in the source idea.