Why I built a semantic grep
tl;dr summary
While auditing large codebases where I didn’t know what to search for, I built a semantic grep that indexes git-tracked files into line-based chunks, embeds them via an OpenAI-compatible API, searches them with LanceDB, and exposes results through grep-style output and MCP resources instead of raw file dumps.
table of contents
Semantic-chan: grep grew a brain (and learned MCP)
I spend a non-trivial amount of my time auditing large, messy codebases. The kind where “grep” technically works, but only if you already know what you are looking for. When you do not, you end up grepping strings, guessing names, and slowly losing faith in humanity.
So I built semantic-chan.
This post is a technical deep dive into what it is, how it works, and why I think it is a genuinely useful tool rather than yet another AI-flavored wrapper.
The problem I wanted to solve
Classic grep is deterministic and fast, but brutally literal. It answers questions like:
- “Where is this exact string?”
- “Where is this symbol name used?”
It does not answer:
- “Where is authentication logic implemented?”
- “Where do we validate user input?”
- “Where is rate limiting handled?”
When auditing unfamiliar codebases, those are the questions that matter.
Semantic-chan is my answer to that gap: a semantic grep that stays local, lightweight, scriptable, and safe enough to expose to tools like Codex or other MCP-capable agents.
What semantic-chan actually does
At a high level, the pipeline is simple and deliberate:
- Walk a repository, respecting
.gitignoreand skipping garbage directories. - Chunk code files into overlapping line-based segments.
- Embed those chunks using an OpenAI-compatible
/v1/embeddingsendpoint. - Store everything locally in LanceDB.
- Query by embedding a natural language prompt and running vector search.
- Present results like grep, JSON, summaries, or MCP resources.
No ASTs. No language servers. No magic. Just vectors, files, and pragmatism.
Architecture overview
The codebase is split cleanly into responsibilities:
- fsutil: repo root detection, gitignore integration, file walking, hashing.
- config: XDG-compliant config and cache handling, safe defaults.
- embedder: an OpenAI-compatible embeddings client (tested with llama-server, but endpoint-agnostic).
- indexer: chunking, incremental indexing, metadata tracking.
- store: LanceDB schema creation, inserts, deletes, vector search.
- search: filtering, snippet merging, context expansion, output formatting.
- cli: Cobra-based CLI with grep-like ergonomics.
- server: minimal HTTP wrapper.
- mcpserver: full MCP server with tools, resources, prompts, and safety checks.
This separation is intentional. Each layer can be reasoned about independently, which matters when you are debugging something at 2am during an audit.
Indexing: boring on purpose
Indexing is where many tools overcomplicate things. Semantic-chan does not.
File discovery that matches reality
- Uses
git ls-filesto respect.gitignore. - Skips heavy directories like
node_modules,vendor,dist,build. - Only indexes known code-like extensions by default.
This immediately cuts noise and prevents indexing secrets by accident.
Chunking strategy
Files are chunked by line count, with configurable overlap. Each chunk stores:
- File path
- Line range
- Chunk index
- Raw content
- Embedding vector
Line-based chunking is not fancy, but it is stable, predictable, and works across languages. That is exactly what I want during audits.
Incremental indexing
A meta.json file tracks SHA1 hashes of file contents. Unchanged files are skipped entirely.
This keeps re-indexing fast enough to run often, without pretending to be clever.
Querying: where it gets interesting
Semantic search with guardrails
A query is embedded once, then used for vector search in LanceDB. Results are post-processed with:
- Glob and case-insensitive glob filters
- File type filters
- Optional path prefixes
If filters are used, semantic-chan automatically oversamples results to avoid empty outputs. This small detail dramatically improves real-world usability.
Adaptive threshold relaxation
Semantic search is noisy by nature. Hard cutoffs are brittle.
Semantic-chan starts with a reasonable distance threshold. If nothing matches, it relaxes the threshold once, within bounds, and tells you it did so.
This avoids the classic “AI tool returned nothing, shrug” experience.
Snippet merging and context expansion
Adjacent chunks from the same file are merged into coherent snippets. Optional context lines are pulled directly from disk, grep-style.
The output looks familiar, which matters when you are trying to reason about code quickly.
Interfaces: CLI, HTTP, and MCP
CLI: grep, but less angry
The CLI is intentionally ergonomic:
- Running
semchan "query"just works. - Output modes: pretty, grep-style, summary, JSON.
- Flags like
--files-only,--count,--unique-filesmake it scriptable. - Color is automatic, but controllable.
It feels like a tool you can actually adopt, not a demo.
HTTP server: minimal and composable
There is a tiny HTTP API:
/health/search
It is not trying to replicate the CLI. It exists to glue semantic-chan into other systems when needed.
MCP server: the real reason this exists
The MCP server is where semantic-chan becomes more than a CLI.
It exposes:
-
Tools
semchan_searchsemchan_read_filesemchan_indexsemchan_index_status
-
Resources
- Repository metadata
- Index metadata
- File resources
- Snippet resources with bounded reads
-
Prompts
- Semantic grep helper
- Symbol tracing helper
Search tools return resource links, not raw file dumps. Agents must explicitly fetch snippets or files. This is not accidental.
Security and safety choices
This is the part most hobby projects ignore. I did not.
- Path traversal is explicitly blocked.
- MCP file reads are repo-relative only.
- Non-code files are disallowed by default.
- File reads are byte-limited.
- HTTP MCP transport requires a bearer token and origin checks.
Semantic-chan is designed to be usable by LLMs without blindly handing them your entire filesystem.
Known limitations and future improvements
This is not a sales pitch. There are things to improve.
- Deleted files are not yet pruned from the index. (which is a pretty big issue)
- The first update in a fresh process can leave stale chunks due to a table-opening edge case.
- Meta file placement differs slightly between CLI and MCP modes.
- Chunking is intentionally naive.
None of these undermine the core idea, and all of them are fixable without bloating the project.
Why I actually use this
Semantic-chan is not trying to replace grep, ripgrep, or language servers.
It answers a different class of questions:
- “Where does this app enforce permissions?”
- “Where is input sanitized?”
- “Where is this concept implemented, even if names differ?”
That makes it invaluable during audits, reviews, and exploratory work. The fact that it plugs cleanly into MCP means it also boosts tools like Codex without handing them the keys to the kingdom.
It is small, local, and understandable. Which, in 2025, feels almost rebellious.