Postgres to ClickHouse, then the embeddings pipeline became the real migration (tiny)

A Postgres to ClickHouse migration is usually a storage story: compression, partitions, fast analytics.

This one turned into an embeddings pipeline story.

ClickHouse still did what it was supposed to do: make a text lake cheap to store and easy to query. The surprise was that the backfill turned embeddings into the pacing item, not the database.

At this snapshot the corpus was 109,743,122 tokens (o200k_base), which is large enough that both cost and throughput matter. For simplicity during backfill, embeddings were generated with max_length 1024 and no chunking, which trades some retrieval quality for predictable throughput.

During the ClickHouse backfill, a Grafana panel showed RabbitMQ queue depth climbing steadily while acks per second stayed low. The GPU box (RTX 4090) never looked as busy as it should have. The curve was directional, not spiky, which is what it looks like when a system is under-capacity all the time.

The root cause was not mysterious. The worker was doing inference like a CPU service:

one message handler embedded exactly one text
it then did storage writes
only then did it ack and fetch the next message

That pattern keeps the GPU fed with tiny, uncoordinated calls. The fixed overhead (Python, scheduling, kernel launches) dominates, and batching never happens naturally.

The fix was to treat inference like a shared resource:

centralize encode() behind a microbatching broker (wait ~10 ms to form small batches)
route all model execution through a single lane to avoid contention and make batching predictable
add in-flight de-dupe and an optional cache so duplicates do not burn GPU time
enable TensorRT via torch.compile, and persist compile artifacts and model caches so restarts stay fast

A small but important point: microbatching only helps if there is enough parallel demand. If inserts or ack behavior keep the pipeline serial, batch size falls back to 1 and the GPU goes idle again.

After that, inference stopped being the long pole. BAAI/bge-m3 with TensorRT settled into consistent low-latency behavior (mean ~11.65 ms, p95 ~15.7 ms per request), and the bottleneck moved to ClickHouse inserts.

That is a better place to be, but it comes with the next set of work: reduce round trips, avoid tiny inserts, and keep enough messages in flight so the broker can actually form batches.

The lesson is that “embeddings at scale” is rarely just the model. It is the queueing, batching, compilation, and storage around it - and once those parts behave like a system, the fixes become straightforward.