Semantic-chan: grep grew a brain (and learned MCP)

When I audit unfamiliar codebases, grep only gets me so far. It is great when I already know the identifiers I need. It is much less helpful when the real questions are conceptual:

Where is auth actually implemented?
Where is input validated?
Where do we rate limit?

That gap is why I built semantic-chan: a semantic grep that stays local, keeps outputs grep-shaped, and is safe enough to expose to agents via MCP.

The pipeline

The workflow is intentionally boring:

Use git ls-files so indexing matches .gitignore reality.
Chunk files into overlapping line-based segments.
Embed each chunk using an OpenAI-compatible /v1/embeddings endpoint.
Store vectors + metadata locally in LanceDB.
Embed the query, run vector search, then format results as grep/JSON/summary or MCP resources.

No ASTs. No LSP integration. Just vectors, line ranges, and predictable behavior.

What makes it usable in practice

Indexing and querying have a few “small but important” details:

Noise control: skip heavy directories (node_modules, vendor, dist, build) and only index code-like extensions by default.
Incremental updates: a meta.json tracks SHA1 hashes so unchanged files are skipped.
Filters: glob / case-insensitive glob, file type filters, and optional path prefixes.
Oversampling: if you filter, it pulls more candidates first so you do not accidentally filter everything out.
Threshold relaxation: if nothing matches, it relaxes the distance cutoff once (within bounds) and tells you.
Snippet merging + context: adjacent chunks merge into one snippet; optional context lines come from disk, grep-style.

MCP integration (and why it matters)

The MCP server is the real reason semantic-chan exists.

It exposes:

Tools: semchan_search, semchan_read_file, semchan_index, semchan_index_status
Resources: repo metadata, index metadata, file resources, snippet resources with bounded reads
Prompts: semantic grep helper, symbol tracing helper

Crucially, searches return resource links, not raw file dumps. Agents must explicitly fetch snippets or files, and those reads are bounded.

Safety choices

Path traversal is blocked.
MCP file reads are repo-relative only.
Non-code files are disallowed by default.
File reads are byte-limited.
HTTP MCP transport requires a bearer token and origin checks.

Limitations (for now)

Deleted files are not yet pruned from the index.
A fresh process can leave stale chunks due to a table-opening edge case.
Meta file placement differs slightly between CLI and MCP modes.
Chunking is intentionally naive.

Takeaway

Semantic-chan is not trying to replace grep. It answers a different class of “where is this idea implemented?” questions, which makes it genuinely useful for audits and exploratory work.