BLOG_POST / why-i-built-a-semantic-grep

Why I built a semantic grep

3 min read
435 words
tl;dr summary

While auditing large codebases where I didn’t know what to search for, I built a semantic grep that indexes git-tracked files into line-based chunks, embeds them via an OpenAI-compatible API, searches them with LanceDB, and exposes results through grep-style output and MCP resources instead of raw file dumps.

Semantic-chan: grep grew a brain (and learned MCP)

When I audit unfamiliar codebases, grep only gets me so far. It is great when I already know the identifiers I need. It is much less helpful when the real questions are conceptual:

  • Where is auth actually implemented?
  • Where is input validated?
  • Where do we rate limit?

That gap is why I built semantic-chan: a semantic grep that stays local, keeps outputs grep-shaped, and is safe enough to expose to agents via MCP.


The pipeline

The workflow is intentionally boring:

  1. Use git ls-files so indexing matches .gitignore reality.
  2. Chunk files into overlapping line-based segments.
  3. Embed each chunk using an OpenAI-compatible /v1/embeddings endpoint.
  4. Store vectors + metadata locally in LanceDB.
  5. Embed the query, run vector search, then format results as grep/JSON/summary or MCP resources.

No ASTs. No LSP integration. Just vectors, line ranges, and predictable behavior.


What makes it usable in practice

Indexing and querying have a few “small but important” details:

  • Noise control: skip heavy directories (node_modules, vendor, dist, build) and only index code-like extensions by default.
  • Incremental updates: a meta.json tracks SHA1 hashes so unchanged files are skipped.
  • Filters: glob / case-insensitive glob, file type filters, and optional path prefixes.
  • Oversampling: if you filter, it pulls more candidates first so you do not accidentally filter everything out.
  • Threshold relaxation: if nothing matches, it relaxes the distance cutoff once (within bounds) and tells you.
  • Snippet merging + context: adjacent chunks merge into one snippet; optional context lines come from disk, grep-style.

MCP integration (and why it matters)

The MCP server is the real reason semantic-chan exists.

It exposes:

  • Tools: semchan_search, semchan_read_file, semchan_index, semchan_index_status
  • Resources: repo metadata, index metadata, file resources, snippet resources with bounded reads
  • Prompts: semantic grep helper, symbol tracing helper

Crucially, searches return resource links, not raw file dumps. Agents must explicitly fetch snippets or files, and those reads are bounded.


Safety choices

  • Path traversal is blocked.
  • MCP file reads are repo-relative only.
  • Non-code files are disallowed by default.
  • File reads are byte-limited.
  • HTTP MCP transport requires a bearer token and origin checks.

Limitations (for now)

  • Deleted files are not yet pruned from the index.
  • A fresh process can leave stale chunks due to a table-opening edge case.
  • Meta file placement differs slightly between CLI and MCP modes.
  • Chunking is intentionally naive.

Takeaway

Semantic-chan is not trying to replace grep. It answers a different class of “where is this idea implemented?” questions, which makes it genuinely useful for audits and exploratory work.

hash: a25
EOF