Context

This post started as a storage migration: port an old text-only dataset from Postgres to ClickHouse, then use ClickHouse for analytics and storage efficiency.

During the backfill, it turned into an inference optimization project.

A Grafana card showed RabbitMQ queue depth that kept climbing while acks per second stayed low. GPU utilization on a dedicated box did not match expectations. The backlog was not spiky. It was directional.

This write-up focuses on the embedding service: what it does, why it was slow, what changed, and why TensorRT plus batching ended up doing most of the heavy lifting.

Snapshot

Rows migrated: ~4.80M (text-only rows ported out of Postgres)
Token count: 109,743,122 (computed with o200k_base)
Text length: avg 81.22 (min 1, max 30,228 characters)
GPU: RTX 4090 (24 GB VRAM, bare metal host, dockerized)

Why ClickHouse

ClickHouse was chosen for a set of practical reasons that fit a text lake:

zstd compression for predictable storage wins
fast analytics for KPIs and time series
a clean path for embedding search via cosine similarity (dot product on normalized vectors)
tiered storage and cold offload
JSON support for flexible metadata

Migration strategy

I used backfill then switch:

scrapers were stopped during migration
the existing Postgres corpus was backfilled into ClickHouse
scrapers were rewritten to write into ClickHouse directly
ingestion resumed after cutover

This avoided dual-write complexity and kept correctness simple while the storage layer changed.

The ClickHouse schema (DDL)

The lake has two core tables: one for documents, one for embeddings.

Design goals:

keep document metadata and raw text queryable and cheap to scan
keep embeddings append-friendly and partitioned by model and time
use LowCardinality where values repeat heavily
partition by month for operational manageability and cold storage policies

CREATE TABLE lake.docs
(
  source LowCardinality(String),
  source_id String,
  doc_uuid UUID,
  updated_at DateTime64(3),
  ingest_at DateTime64(3),
  author String,
  text String CODEC(ZSTD(3)),
  meta_json String CODEC(ZSTD(3))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ingest_at)
ORDER BY (source, source_id, updated_at)
SETTINGS index_granularity = 8192;

CREATE TABLE lake.embeddings
(
  model LowCardinality(String),
  doc_uuid UUID,
  updated_at DateTime64(3),
  embedding Array(Float32) CODEC(ZSTD(6))
)
ENGINE = MergeTree
PARTITION BY (model, toYYYYMM(updated_at))
ORDER BY (model, doc_uuid)
SETTINGS index_granularity = 8192;

Notes:

docs are partitioned by ingest month; ordering by (source, source_id, updated_at) makes it easy to query versions and timelines.
embeddings are partitioned by (model, month(updated_at)) to keep multi-model history clean, and ordered by (model, doc_uuid) to match the common join and retrieval shape.
embeddings are stored as Array(Float32) and compressed with a stronger zstd level than text, because vectors are large and compress well.

Token footprint and the cost of outsourcing embeddings

The dataset currently contains 109,743,122 tokens as measured with o200k_base. That number keeps increasing with ongoing ingestion, but even at this snapshot, it is large enough that embedding costs are easy to underestimate.

Before going all in on self-hosted embeddings, I computed what it would cost to embed the entire dataset once using a few hosted embedding APIs (standard pricing, and batch pricing where available).

Estimated cost to embed 109,743,122 tokens once (USD)

loading chart…

Estimated total API cost to embed the full dataset once. Batch pricing is shown where the provider publishes it.

This section is not here to argue that hosted APIs are bad. It is here to explain why a local embedding pipeline needs to be efficient: at this scale, it is very easy to turn embeddings into a recurring cost center, or into a backlog problem, or both.

Where embeddings fit in

The embedding worker is an async service that consumes RabbitMQ messages, embeds text with BAAI/bge-m3 (FlagEmbedding), then writes results to storage or triggers downstream work.

It handles two independent streams:

to_embed: unrelated to the ClickHouse migration; this text is intended for indexing to ElasticSearch (embedding happens here, indexing happens elsewhere).
to_embed_text: the migration path; any text document that is inserted to ClickHouse (lake.docs and lake.embeddings).

Important quality note: I do not chunk long documents in this iteration. Raw text is stored in ClickHouse, but embeddings are produced with a configured max_length (1024 in this setup). That means long text is effectively truncated for the embedding representation. For backfill throughput and operational stability, it was an acceptable trade. If retrieval quality becomes the priority, chunking is the next obvious step.

The signal that something was wrong

I noticed the issue because a Grafana panel showed queue depth that kept climbing while acks per second stayed low.

From the 1-minute export on 2025-12-26:

14:35: first passed 10k (13,807)
15:05: first passed 100k (100,485)
16:14: first passed 500k (512,420)
16:38: first passed 800k (808,778)
17:08: first passed 1M (1,005,023)
21:00: first passed 2M (2,009,989)
21:53: first passed 2.5M (2,500,356)
22:01: reached 2.58M (2,589,117)
22:50: peak (2,944,392)

One additional detail that made the curve feel worse: the steepest 1-hour increase in the export was 619,317 messages (from 16:02 to 17:02).

RabbitMQ queue depth (to_embed_text) during backfill

loading chart…

5-minute sampling from a 1-minute Grafana export of RabbitMQ queue depth (to_embed_text), on 2025-12-26.

The important part is not the exact slope. It is what it implies: the system was consistently ingesting slower than it was producing messages.

The baseline problem (before optimization)

Old inference execution model

The original design was correct but not GPU-friendly:

each RabbitMQ message handler executed inference for exactly one text at a time
the handler then performed storage writes
only after the writes it would ack and move on

Concurrency existed at the message level, but batching did not exist at the inference level. Even with multiple handlers, inference calls were not coordinated, so the GPU rarely saw meaningful batches.

At the time I noticed the backlog, the runtime configuration was effectively serial:

concurrency was 1
RabbitMQ delivery was effectively one unacked message at a time

This created an unproductive cadence:

run a tiny encode call
block on database writes and network round trips
repeat

That pattern can underutilize even a very strong GPU.

Before diagram

What changed (the optimization set)

This section reflects the concrete changes introduced in the embedding service, mostly in services/embedding/main.py, plus the container and compose changes that make TensorRT practical.

Optimizations

In short: this took a per-message, single-item encode() loop and turned it into something that behaves like a GPU service. Inference was centralized behind a microbatching broker, routed through a single execution lane, and protected with in-flight de-dupe plus an optional cache. Precision/attention settings were tightened up, and TensorRT (with persistent compile caches) did the real acceleration work. After that, inference stopped being the long pole - and the bottleneck moved to ClickHouse inserts, which is a much better problem to have.

1) Centralize inference with an InferenceBroker (microbatcher)

A new InferenceBroker was introduced to centralize inference requests and enable microbatching across handlers.

New behavior:

message handlers call a broker API (embed)
the broker collects requests for up to microbatch_max_wait_ms (10 ms)
it runs a single model.encode over the accumulated batch
it returns per-request results to awaiting handlers

Microbatch settings used here:

enabled: true
microbatch_max_wait_ms: 10
microbatch_max_batch_size: 12

Why it matters:

Transformer inference has fixed per-call overhead (kernel launches, scheduling overhead, Python overhead). Batching amortizes those costs and improves GPU occupancy.

Important operational note:

Microbatching only helps if multiple requests arrive inside the batching window. If the pipeline is IO-bound or configured with very low concurrency, batch size will often collapse back to 1.

2) Single-lane model execution (on purpose)

All model execution is routed through a single execution lane (one thread).

Why:

concurrent inference calls on the same model and GPU often cause contention and allocator pressure
a single lane makes latency more predictable and makes batching more effective

Tradeoff observed:

in isolation, single-lane inference was about 20% slower in raw microbenchmarks
in the end-to-end pipeline, it was still a net win once batching and TensorRT were in place

3) In-flight de-duplication and optional cache

The broker tracks in-flight requests keyed by model settings and normalized text. If the same text arrives while it is already being embedded, callers share the same future.

This matters because about 5% of the content is duplicate.

There is also an optional in-process LRU cache (keyed by model settings and normalized text). In this deployment, cache footprint is capped around 256 MB.

4) Precision control and attention backend selection

A runtime config key selects inference precision (auto, fp16, bf16, fp32). On this hardware, bf16 delivered about a 5% speedup versus the previous precision behavior.

There is also a best-effort attention backend selector (SDPA variants and optional flash attention if installed). It helped, but it was not the dominant improvement in this setup.

5) TensorRT enablement (the main time saver)

TensorRT acceleration was the most meaningful improvement in steady-state inference.

Container and build changes:

base image moved to nvcr.io/nvidia/tensorrt:25.08-py3
torch-tensorrt is installed during build so it is always present
a build fix copies .python-version before uv sync so the interpreter choice is stable and the venv is not unexpectedly recreated

Runtime behavior:

config selects inference_backend: torch or tensorrt
when tensorrt is selected, the service attempts torch.compile with the TensorRT backend
if it fails, it logs and falls back to the torch path (non-fatal)

Warmup and caching:

TensorRT warmup time (compile) was about 15 seconds
compiled artifacts are persisted via mounted cache directories
cache size observed: about 4.3 GB

6) Persist compile caches at the compose level

To avoid recompiling on container recreation, the compose setup mounts persistent cache directories for:

torch inductor cache
torch extensions cache
HuggingFace cache (model files)
TensorRT cache

This is the operational difference between a predictable worker restart and a restart that spends time rebuilding the world.

7) Faster config snapshot copying

Config snapshots were previously deep-copied via JSON encode and decode. That is correct but expensive and can coerce types unexpectedly. This was replaced with copy.deepcopy, which is simpler and faster for typical Python dict trees.

After diagram (what the pipeline looks like now)

Microbatching mechanics (sequence)

Results (inference-only)

After the optimization set, inference timings are captured in the chart below.

Inference latency after optimization

loading chart…

Min/mean/p95/max; Qwen values scaled vs bge-m3.

The important part is not a single number. It is that the worker finally behaved like it had a GPU: inference became fast and consistent enough that other parts of the pipeline started to dominate.

The next bottleneck: ClickHouse inserts

Once inference became fast, the limiting factor moved to ClickHouse insert latency.

That is not surprising:

the to_embed_text path performs multiple writes per message (docs plus embeddings)
if inserts are slow, the handler spends time blocked on IO between messages
blocking reduces how quickly the next message is processed
reduced throughput also reduces how many requests land inside the microbatch window, which reduces batch sizes

At this point, the pipeline is IO-paced. That is still a better place to be than model-paced, because it means the embedding engine is no longer the main limiter.

Practical next steps

If the goal is to drain multi-million message backlogs faster, the next work is mostly around write amplification:

reduce round trips (stage docs and embeddings more efficiently)
increase concurrency carefully while keeping in-flight memory bounded
confirm RabbitMQ delivery settings so the broker has enough parallel demand to form real batches

Closing

This migration was supposed to be a ClickHouse post about compression and analytics. It still is, but the most useful part ended up being an infrastructure lesson:

At scale, an embedding pipeline is not just a model call. It is a queueing system, a batching system, a compilation system, and a storage system.

Once you treat it like that, the fixes become straightforward.