Why I built a semantic grep
tl;dr summary
While auditing large codebases where I didn’t know what to search for, I built a semantic grep that indexes git-tracked files into line-based chunks, embeds them via an OpenAI-compatible API, searches them with LanceDB, and exposes results through grep-style output and MCP resources instead of raw file dumps.
Semantic-chan: grep grew a brain (and learned MCP)
When I audit unfamiliar codebases, grep only gets me so far. It is great when I already know the identifiers I need. It is much less helpful when the real questions are conceptual:
- Where is auth actually implemented?
- Where is input validated?
- Where do we rate limit?
That gap is why I built semantic-chan: a semantic grep that stays local, keeps outputs grep-shaped, and is safe enough to expose to agents via MCP.
The pipeline
The workflow is intentionally boring:
- Use
git ls-filesso indexing matches.gitignorereality. - Chunk files into overlapping line-based segments.
- Embed each chunk using an OpenAI-compatible
/v1/embeddingsendpoint. - Store vectors + metadata locally in LanceDB.
- Embed the query, run vector search, then format results as grep/JSON/summary or MCP resources.
No ASTs. No LSP integration. Just vectors, line ranges, and predictable behavior.
What makes it usable in practice
Indexing and querying have a few “small but important” details:
- Noise control: skip heavy directories (
node_modules,vendor,dist,build) and only index code-like extensions by default. - Incremental updates: a
meta.jsontracks SHA1 hashes so unchanged files are skipped. - Filters: glob / case-insensitive glob, file type filters, and optional path prefixes.
- Oversampling: if you filter, it pulls more candidates first so you do not accidentally filter everything out.
- Threshold relaxation: if nothing matches, it relaxes the distance cutoff once (within bounds) and tells you.
- Snippet merging + context: adjacent chunks merge into one snippet; optional context lines come from disk, grep-style.
MCP integration (and why it matters)
The MCP server is the real reason semantic-chan exists.
It exposes:
- Tools:
semchan_search,semchan_read_file,semchan_index,semchan_index_status - Resources: repo metadata, index metadata, file resources, snippet resources with bounded reads
- Prompts: semantic grep helper, symbol tracing helper
Crucially, searches return resource links, not raw file dumps. Agents must explicitly fetch snippets or files, and those reads are bounded.
Safety choices
- Path traversal is blocked.
- MCP file reads are repo-relative only.
- Non-code files are disallowed by default.
- File reads are byte-limited.
- HTTP MCP transport requires a bearer token and origin checks.
Limitations (for now)
- Deleted files are not yet pruned from the index.
- A fresh process can leave stale chunks due to a table-opening edge case.
- Meta file placement differs slightly between CLI and MCP modes.
- Chunking is intentionally naive.
Takeaway
Semantic-chan is not trying to replace grep. It answers a different class of “where is this idea implemented?” questions, which makes it genuinely useful for audits and exploratory work.