Semantic-chan: grep grew a brain (and learned MCP)

I spend a non-trivial amount of my time auditing large, messy codebases. The kind where “grep” technically works, but only if you already know what you are looking for. When you do not, you end up grepping strings, guessing names, and slowly losing faith in humanity.

So I built semantic-chan.

This post is a technical deep dive into what it is, how it works, and why I think it is a genuinely useful tool rather than yet another AI-flavored wrapper.

The problem I wanted to solve

Classic grep is deterministic and fast, but brutally literal. It answers questions like:

“Where is this exact string?”
“Where is this symbol name used?”

It does not answer:

“Where is authentication logic implemented?”
“Where do we validate user input?”
“Where is rate limiting handled?”

When auditing unfamiliar codebases, those are the questions that matter.

Semantic-chan is my answer to that gap: a semantic grep that stays local, lightweight, scriptable, and safe enough to expose to tools like Codex or other MCP-capable agents.

What semantic-chan actually does

At a high level, the pipeline is simple and deliberate:

Walk a repository, respecting .gitignore and skipping garbage directories.
Chunk code files into overlapping line-based segments.
Embed those chunks using an OpenAI-compatible /v1/embeddings endpoint.
Store everything locally in LanceDB.
Query by embedding a natural language prompt and running vector search.
Present results like grep, JSON, summaries, or MCP resources.

No ASTs. No language servers. No magic. Just vectors, files, and pragmatism.

Architecture overview

The codebase is split cleanly into responsibilities:

fsutil: repo root detection, gitignore integration, file walking, hashing.
config: XDG-compliant config and cache handling, safe defaults.
embedder: an OpenAI-compatible embeddings client (tested with llama-server, but endpoint-agnostic).
indexer: chunking, incremental indexing, metadata tracking.
store: LanceDB schema creation, inserts, deletes, vector search.
search: filtering, snippet merging, context expansion, output formatting.
cli: Cobra-based CLI with grep-like ergonomics.
server: minimal HTTP wrapper.
mcpserver: full MCP server with tools, resources, prompts, and safety checks.

This separation is intentional. Each layer can be reasoned about independently, which matters when you are debugging something at 2am during an audit.

Indexing: boring on purpose

Indexing is where many tools overcomplicate things. Semantic-chan does not.

File discovery that matches reality

Uses git ls-files to respect .gitignore.
Skips heavy directories like node_modules, vendor, dist, build.
Only indexes known code-like extensions by default.

This immediately cuts noise and prevents indexing secrets by accident.

Chunking strategy

Files are chunked by line count, with configurable overlap. Each chunk stores:

File path
Line range
Chunk index
Raw content
Embedding vector

Line-based chunking is not fancy, but it is stable, predictable, and works across languages. That is exactly what I want during audits.

Incremental indexing

A meta.json file tracks SHA1 hashes of file contents. Unchanged files are skipped entirely.

This keeps re-indexing fast enough to run often, without pretending to be clever.

Querying: where it gets interesting

Semantic search with guardrails

A query is embedded once, then used for vector search in LanceDB. Results are post-processed with:

Glob and case-insensitive glob filters
File type filters
Optional path prefixes

If filters are used, semantic-chan automatically oversamples results to avoid empty outputs. This small detail dramatically improves real-world usability.

Adaptive threshold relaxation

Semantic search is noisy by nature. Hard cutoffs are brittle.

Semantic-chan starts with a reasonable distance threshold. If nothing matches, it relaxes the threshold once, within bounds, and tells you it did so.

This avoids the classic “AI tool returned nothing, shrug” experience.

Snippet merging and context expansion

Adjacent chunks from the same file are merged into coherent snippets. Optional context lines are pulled directly from disk, grep-style.

The output looks familiar, which matters when you are trying to reason about code quickly.

Interfaces: CLI, HTTP, and MCP

CLI: grep, but less angry

The CLI is intentionally ergonomic:

Running semchan "query" just works.
Output modes: pretty, grep-style, summary, JSON.
Flags like --files-only, --count, --unique-files make it scriptable.
Color is automatic, but controllable.

It feels like a tool you can actually adopt, not a demo.

HTTP server: minimal and composable

There is a tiny HTTP API:

/health
/search

It is not trying to replicate the CLI. It exists to glue semantic-chan into other systems when needed.

MCP server: the real reason this exists

The MCP server is where semantic-chan becomes more than a CLI.

It exposes:

Tools
- semchan_search
- semchan_read_file
- semchan_index
- semchan_index_status
Resources
- Repository metadata
- Index metadata
- File resources
- Snippet resources with bounded reads
Prompts
- Semantic grep helper
- Symbol tracing helper

Search tools return resource links, not raw file dumps. Agents must explicitly fetch snippets or files. This is not accidental.

Security and safety choices

This is the part most hobby projects ignore. I did not.

Path traversal is explicitly blocked.
MCP file reads are repo-relative only.
Non-code files are disallowed by default.
File reads are byte-limited.
HTTP MCP transport requires a bearer token and origin checks.

Semantic-chan is designed to be usable by LLMs without blindly handing them your entire filesystem.

Known limitations and future improvements

This is not a sales pitch. There are things to improve.

Deleted files are not yet pruned from the index. (which is a pretty big issue)
The first update in a fresh process can leave stale chunks due to a table-opening edge case.
Meta file placement differs slightly between CLI and MCP modes.
Chunking is intentionally naive.

None of these undermine the core idea, and all of them are fixable without bloating the project.

Why I actually use this

Semantic-chan is not trying to replace grep, ripgrep, or language servers.

It answers a different class of questions:

“Where does this app enforce permissions?”
“Where is input sanitized?”
“Where is this concept implemented, even if names differ?”

That makes it invaluable during audits, reviews, and exploratory work. The fact that it plugs cleanly into MCP means it also boosts tools like Codex without handing them the keys to the kingdom.

It is small, local, and understandable. Which, in 2025, feels almost rebellious.