Postgres to ClickHouse, then the embeddings pipeline became the real migration
tl;dr summary
A Postgres-to-ClickHouse backfill surfaced a directional RabbitMQ backlog. The fix was to make the embedding worker behave like a GPU service (microbatching broker, single-lane execution, TensorRT + persistent caches) - which pushed the bottleneck from inference to ClickHouse inserts.
table of contents
A Postgres to ClickHouse migration was supposed to be about storage, compression, and better analytics.
Instead, the backfill surfaced a much more expensive problem: the embeddings pipeline.
RabbitMQ queue depth climbed steadily, acks per second stayed low, and the GPU never looked as busy as it should have been. The backlog was not spiky. It was directional, which usually means the system is consistently slower at consuming than producing.
Snapshot
- Rows migrated: ~4.80M
- Token count: 109,743,122 (
o200k_base) - Text length: avg 81.22 (min 1, max 30,228 chars)
- GPU: RTX 4090 (24 GB)
Why ClickHouse (one paragraph)
ClickHouse fits a text lake well: zstd compression makes storage predictable, analytic queries are fast, and you get a clean path to vector search (cosine similarity via dot product on normalized embeddings). It also keeps operational knobs (partitioning, tiered storage, cold offload) close to the data.
Migration strategy
The migration itself was intentionally boring: backfill then switch.
- pause scrapers and ingestion
- backfill the existing Postgres corpus into ClickHouse
- rewrite scrapers to write into ClickHouse directly
- resume ingestion after cutover
That avoids dual-write complexity and keeps correctness simple.
Schema at a glance
The lake is basically two ClickHouse MergeTree tables: one for documents, one for embeddings.
- docs: text + metadata, partitioned by month, ordered to make versions and timelines easy
- embeddings:
Array(Float32)vectors, partitioned by (model, month) and ordered by (model, doc_uuid)
Compression is tuned per column: text and metadata use lighter zstd, vectors use stronger zstd.
A quality tradeoff (for throughput)
During the backfill, I did not chunk long documents. Raw text is stored in ClickHouse, but embeddings are produced with a configured max_length (1024 in this setup). That means very long text gets truncated for the embedding representation.
It is not the final word on retrieval quality, but it kept the pipeline simple and predictable while chasing throughput.
The hidden cost: embedding at scale
At 109M tokens, hosted embeddings are no longer “a quick API call”. API pricing becomes large enough to matter, and if your embedding service is slow, backlog becomes an operational problem.
Where the backlog showed up
The embedding worker consumes RabbitMQ messages, embeds text with BAAI/bge-m3 (FlagEmbedding), then writes results. During backfill, the to_embed_text queue (docs + embeddings) climbed into the millions.
The important takeaway is simple: the worker was not keeping up.
Root cause (the GPU never saw real batches)
The original execution model was correct but not GPU-friendly:
- each message handler embedded one text at a time
- it then did storage writes
- only after the writes did it ack and move on
Even if you add handler concurrency, that pattern often still produces tiny, uncoordinated inference calls. The GPU pays fixed overhead per call, and it never gets enough work in-flight to form meaningful batches.
What changed (the optimization set)
The fixes were a combination of queueing mechanics, inference mechanics, and cache mechanics:
-
Centralize inference behind a microbatching broker. Handlers submit requests to a broker that waits up to 10 ms to collect multiple texts, then runs a single batched
encode()and returns per-request results (max batch size 12). -
Single-lane model execution. All model calls run through one execution lane on purpose. It reduces contention and makes batching predictable.
-
In-flight de-dupe and an optional LRU cache. Duplicate inputs share the same in-flight future (about 5% of the content was duplicate), and recently-seen texts can be served without recompute.
-
Precision and attention tuning. A small win, but worthwhile once the basics are fixed (bf16 helped on this hardware).
-
TensorRT enablement. This was the real accelerator. The service attempts
torch.compilewith the TensorRT backend, falls back to regular Torch on failure, and persists compile artifacts so restarts do not pay the full warmup cost. -
Persist caches at the compose level. Torch inductor cache, torch extensions cache, HuggingFace cache, and TensorRT cache are mounted so the worker restarts like a service, not like a build job. The TensorRT warmup was about 15 seconds, and the cached artifacts were a few GB.
-
Avoid slow config copying. Config snapshots were previously deep-copied via JSON round-trips. Switching to
copy.deepcopyremoved overhead from a hot path.
Results (inference-only)
After the optimization set, inference timings are captured in the chart below.
Once inference became fast and consistent, the pipeline finally behaved like it had a GPU.
The next bottleneck
After inference stopped dominating, the limiting factor moved to ClickHouse insert latency.
That shift is a win. It means the system is no longer model-bound. It is IO-paced, which is usually easier to attack with batching, fewer round trips, and careful concurrency.
One subtle feedback loop shows up here: slow inserts mean handlers spend more time blocked on IO, which reduces parallel demand, which reduces how many requests land inside the microbatch window. When the pipeline is insert-paced, batch sizes can collapse back toward 1 unless you keep enough messages in flight.
Practical next steps
If the goal is to drain multi-million message backlogs faster, the next work is mostly around write amplification:
- reduce round trips (stage docs and embeddings more efficiently)
- increase concurrency carefully while keeping in-flight memory bounded
- confirm RabbitMQ delivery settings so the broker has enough parallel demand to form real batches
Closing
This started as a storage migration, but the real lesson was infrastructure: at scale, an embedding pipeline is not just a model call. It is a queueing system, a batching system, a compilation system, and a storage system.