Postgres to ClickHouse, then the embeddings pipeline became the real migration (short)

A Postgres to ClickHouse migration was supposed to be about storage, compression, and better analytics.

Instead, the backfill surfaced a much more expensive problem: the embeddings pipeline.

RabbitMQ queue depth climbed steadily, acks per second stayed low, and the GPU never looked as busy as it should have been. The backlog was not spiky. It was directional, which usually means the system is consistently slower at consuming than producing.

Snapshot

Rows migrated: ~4.80M
Token count: 109,743,122 (o200k_base)
Text length: avg 81.22 (min 1, max 30,228 chars)
GPU: RTX 4090 (24 GB)

Why ClickHouse (one paragraph)

ClickHouse fits a text lake well: zstd compression makes storage predictable, analytic queries are fast, and you get a clean path to vector search (cosine similarity via dot product on normalized embeddings). It also keeps operational knobs (partitioning, tiered storage, cold offload) close to the data.

Migration strategy

The migration itself was intentionally boring: backfill then switch.

pause scrapers and ingestion
backfill the existing Postgres corpus into ClickHouse
rewrite scrapers to write into ClickHouse directly
resume ingestion after cutover

That avoids dual-write complexity and keeps correctness simple.

Schema at a glance

The lake is basically two ClickHouse MergeTree tables: one for documents, one for embeddings.

docs: text + metadata, partitioned by month, ordered to make versions and timelines easy
embeddings: Array(Float32) vectors, partitioned by (model, month) and ordered by (model, doc_uuid)

Compression is tuned per column: text and metadata use lighter zstd, vectors use stronger zstd.

A quality tradeoff (for throughput)

During the backfill, I did not chunk long documents. Raw text is stored in ClickHouse, but embeddings are produced with a configured max_length (1024 in this setup). That means very long text gets truncated for the embedding representation.

It is not the final word on retrieval quality, but it kept the pipeline simple and predictable while chasing throughput.

The hidden cost: embedding at scale

At 109M tokens, hosted embeddings are no longer “a quick API call”. API pricing becomes large enough to matter, and if your embedding service is slow, backlog becomes an operational problem.

Estimated cost to embed 109,743,122 tokens once (USD)

loading chart…

Estimated total API cost to embed the full dataset once. Batch pricing is shown where the provider publishes it.

Where the backlog showed up

The embedding worker consumes RabbitMQ messages, embeds text with BAAI/bge-m3 (FlagEmbedding), then writes results. During backfill, the to_embed_text queue (docs + embeddings) climbed into the millions.

RabbitMQ queue depth (to_embed_text) during backfill

loading chart…

5-minute sampling from a 1-minute Grafana export of RabbitMQ queue depth (to_embed_text), on 2025-12-26.

The important takeaway is simple: the worker was not keeping up.

Root cause (the GPU never saw real batches)

The original execution model was correct but not GPU-friendly:

each message handler embedded one text at a time
it then did storage writes
only after the writes did it ack and move on

Even if you add handler concurrency, that pattern often still produces tiny, uncoordinated inference calls. The GPU pays fixed overhead per call, and it never gets enough work in-flight to form meaningful batches.

What changed (the optimization set)

The fixes were a combination of queueing mechanics, inference mechanics, and cache mechanics:

Centralize inference behind a microbatching broker. Handlers submit requests to a broker that waits up to 10 ms to collect multiple texts, then runs a single batched encode() and returns per-request results (max batch size 12).
Single-lane model execution. All model calls run through one execution lane on purpose. It reduces contention and makes batching predictable.
In-flight de-dupe and an optional LRU cache. Duplicate inputs share the same in-flight future (about 5% of the content was duplicate), and recently-seen texts can be served without recompute.
Precision and attention tuning. A small win, but worthwhile once the basics are fixed (bf16 helped on this hardware).
TensorRT enablement. This was the real accelerator. The service attempts torch.compile with the TensorRT backend, falls back to regular Torch on failure, and persists compile artifacts so restarts do not pay the full warmup cost.
Persist caches at the compose level. Torch inductor cache, torch extensions cache, HuggingFace cache, and TensorRT cache are mounted so the worker restarts like a service, not like a build job. The TensorRT warmup was about 15 seconds, and the cached artifacts were a few GB.
Avoid slow config copying. Config snapshots were previously deep-copied via JSON round-trips. Switching to copy.deepcopy removed overhead from a hot path.

Results (inference-only)

After the optimization set, inference timings are captured in the chart below.

Inference latency after optimization

loading chart…

Min/mean/p95/max; Qwen values scaled vs bge-m3.

Once inference became fast and consistent, the pipeline finally behaved like it had a GPU.

The next bottleneck

After inference stopped dominating, the limiting factor moved to ClickHouse insert latency.

That shift is a win. It means the system is no longer model-bound. It is IO-paced, which is usually easier to attack with batching, fewer round trips, and careful concurrency.

One subtle feedback loop shows up here: slow inserts mean handlers spend more time blocked on IO, which reduces parallel demand, which reduces how many requests land inside the microbatch window. When the pipeline is insert-paced, batch sizes can collapse back toward 1 unless you keep enough messages in flight.

Practical next steps

If the goal is to drain multi-million message backlogs faster, the next work is mostly around write amplification:

reduce round trips (stage docs and embeddings more efficiently)
increase concurrency carefully while keeping in-flight memory bounded
confirm RabbitMQ delivery settings so the broker has enough parallel demand to form real batches

Closing

This started as a storage migration, but the real lesson was infrastructure: at scale, an embedding pipeline is not just a model call. It is a queueing system, a batching system, a compilation system, and a storage system.