BLOG_POST / mogador-scraping

Mogador: enterprise-grade scraping with one beefy workstation

9 min read
1618 words
tl;dr summary

Mogador is a Rust/Python scraping pipeline running on formidable: RabbitMQ queues, Rust workers, MinIO, Elasticsearch, ClickHouse, GPU tagging, embeddings, dedupe, Kafka logging, a friendly osu! ingestion path for ozut, and a compact extension model for new data formats.

Mogador is my media and text ingestion system. It runs locally, on a workstation named formidable, and has grown into the data substrate behind several ML projects. At the current planning snapshot it has processed more than 13 million files, keeps about 9.18 million indexed documents in Elasticsearch, and stores 11.77 million embedded text rows in ClickHouse.

The hardware is a Ryzen 9 7950X with 61 GiB RAM, an RTX 4090, fast SSD working storage, and a RAID1 media tier made from 8 TB disks. The RAID1 tier exposes about 7.28 TiB usable, with roughly 6.47 TiB currently used. The GPU is shared across visual tagging, embeddings, thumbnail/transcode work, and local model experiments, so the software has to be polite about concurrency. Letting every worker sprint into CUDA at once is a good way to create a very expensive traffic jam.

What the pipeline does

The main flow is direct:

  1. Scrapers discover records.
  2. PostgreSQL stores claims, checkpoints, scraper state, and idempotency keys.
  3. RabbitMQ dispatches download work.
  4. downloader-rs streams media into MinIO staging.
  5. optimizer-rs normalizes media and performs deduplication.
  6. Tagger, embedder, thumbnailer, and indexer workers enrich the result.
  7. Elasticsearch serves search, filtering, ranking, and vector retrieval.
  8. ClickHouse stores text rows, embeddings, and analytical history.
  9. Kafka receives structured service events for audit, replay, and debugging.

The important design choice is that source adapters stop at discovery. Tagging, embeddings, media normalization, thumbnailing, dedupe, and indexing stay in shared workers. A source adapter emits the shared message shape, then the rest of the pipeline does the expensive work. That is why adding a new source is usually a local change rather than a new platform.

This also makes backpressure survivable. Scrapers can keep discovering while GPU work catches up later. The optimizer can build a queue without losing scraper checkpoints. Elasticsearch can reject a batch and the indexer can retry. ClickHouse can receive inserts in batches sized for analytical storage. The system has pressure points, but they are visible pressure points rather than mystery puddles under the sink.

Storage layout

Mogador uses separate stores because each one answers a different question:

LayerRole
RabbitMQtask queues
Kafkastructured event log
PostgreSQLscraper state and idempotency
MinIOstaged raw/intermediate objects
RocksDBoptimizer dedupe state
Elasticsearchonline search and retrieval
ClickHouseanalytics, text rows, embeddings
filesystemsharded final media and thumbnails

RabbitMQ decides what should happen next. Kafka records what happened. PostgreSQL answers whether a scraper item has been seen before. RocksDB answers whether media is duplicate or already being processed. Elasticsearch answers interactive search questions. ClickHouse answers analytical questions over millions of rows.

Final files use hash fan-out / Git-style directory sharding. A UUID or content-derived key is split into prefix directories, for example e9/c8/..., so no single directory becomes a landfill of millions of entries. This keeps filesystem tools responsive when the corpus gets large enough to bully naive layouts.

Downloading and optimization

The downloader streams bytes instead of loading whole objects into memory. It applies max-size limits, blocks unwanted extensions, handles per-source cookies where configured, writes the staged object, then publishes an optimization message.

The optimizer is the worker that turns raw staged artifacts into durable normalized outputs. For images it computes BLAKE3 and perceptual hashes, then uses libvips for normalization. For videos and audio it samples frames, computes video perceptual hashes, and uses ffmpeg for transcode work, including AV1-oriented encodes where the storage and quality tradeoff makes sense.

Deduplication uses a RocksDB-backed state machine. Exact hashes catch byte-identical files. Perceptual hashes catch visually similar images and sampled video frames. The state machine matters because dedupe and publishing have to move together: a worker cannot claim a duplicate, publish half the downstream messages, crash, and leave the system guessing. Items can be pending, ready to publish, publishing, published, or aborted.

At this scale, dedupe is both correctness and cost control. If duplicate media keeps flowing downstream, the system wastes CPU on transcodes, GPU time on tags and embeddings, disk space on normalized artifacts, and index space on redundant search documents. Catching it before long-term storage keeps later stages focused on canonical artifacts.

Search, analytics, and ML

Elasticsearch currently holds the searchable media documents. The planning snapshot uses 9.18M documents and 330.7 GB of primary store. The index includes metadata, tags, dense vectors, and fields needed for retrieval and ranking experiments. The indexer prepares documents concurrently and flushes through the Elasticsearch bulk API so throughput stays predictable.

ClickHouse is the analytical side. It stores text/comment rows in lake.docs and BGE-M3 vectors in lake.embeddings. The embedding table planning snapshot is 11.77M rows, 44.89 GiB raw vector footprint, and 34.20 GiB compressed on disk.

The visual tagger uses SmilingWolf/wd-eva02-large-tagger-v3 through ONNXRuntime GPU. It turns images and video frames into ranked semantic tags, characters, style descriptors, and safety-ish categories. Those tags power search facets, dataset filtering, training manifest exports, and later ML experiments.

The embedder uses BAAI/bge-m3 through FlagEmbedding. It consumes two streams: to_embed, for media-derived search documents, and to_embed_text, for text/comment rows going into ClickHouse. Both GPU services use short microbatch windows. A few milliseconds of waiting can turn scattered single-message calls into usable batches, which is how a 4090 stays busy instead of sighing at batch size 1 forever.

The useful part is that data exists in multiple shapes at once. ClickHouse can scan text and metadata. Elasticsearch can serve filters and retrieval. Dense vectors support clustering, semantic search, style grouping, and reranking experiments. The same stable item can participate in analytics, search, model training, and dataset export without rebuilding the whole corpus each time.

Kafka logging and retries

Kafka is the audit spine. Services emit structured events such as download_succeeded, duplicate_detected, tag_batch_completed, embedding_batch_inserted, index_bulk_flushed, and dead_lettered. Each event carries fields like event_id, trace_id, service name, queue, attempt, duration, media kind, source class, and timestamp.

That gives the system a replayable operational history. If an item disappeared, duplicated, failed tagging, or got indexed with stale metadata, the timeline can be reconstructed without asking Elasticsearch or ClickHouse to become the event log.

Retries are explicit and counted. Dead-letter queues carry reason labels such as http_404, unsupported_media, max_retries_exceeded, pending_duplicate, clickhouse_error, db_error, model_error, and invalid_payload. A dead-letter message should be something you can triage, repair, and requeue; otherwise it becomes an inbox for ghosts.

The operational benefit is that problems stay classified. If ClickHouse insert latency grows, that is a ClickHouse/write-path problem. If ffmpeg latency grows, that is a CPU/transcode problem. If GPU batches collapse to size 1, that is an inference scheduling problem. The system still requires attention, but it usually tells me which corner is smoking.

osu! and ozut

The osu! path supports ozut, my osu! beatmap generation project. A metadata sync keeps beatmapset and beatmap state locally, then beatmap archives enter Mogador through the same to_download queue as other artifacts.

The beatmap transformer extracts .osz archives, converts the primary audio to Opus 128 kbps, patches each .osu file so AudioFilename: points at the normalized Opus track, and removes unrelated assets such as custom hitsounds, storyboards, videos, and backgrounds. The cleaned corpus keeps the useful structure: timing, hit objects, difficulty metadata, and audio. The current ozut training corpus is over 200 GB and 301,374 beatmaps.

That path matters because rhythm-game data has a different shape from ordinary media. The model needs synchronized timing and audio far more than decorative archive contents. Treating beatmaps as first-class transformable artifacts lets the same Mogador machinery prepare them without building a separate ingestion island.

What it enabled

Mogador has already powered several downstream projects:

  • comment style classification with a DeBERTa-based classifier
  • topic modeling over comment embeddings
  • LLM fine-tuning with Unsloth
  • SDXL / SD3 LoRA dataset construction
  • ozut training on cleaned beatmap archives
  • recommender and reranking experiments

The common theme is that local data becomes easier to reuse once it has stable IDs, dedupe state, tags, embeddings, and searchable metadata. Dataset construction becomes a query and export problem instead of a folder archaeology problem.

For comment projects, the rows can be scanned analytically and retrieved semantically. For image projects, tags and dedupe state make LoRA dataset selection more controlled. For ozut, cleaned beatmaps provide structure without dragging unrelated assets through every experiment. For recommendations, Elasticsearch and ClickHouse provide different views of the same corpus.

AWS estimate

A rough AWS version gets expensive quickly. With a 9.46 TiB object corpus, 13M processed files, SQS-style queue hops, OpenSearch, managed ClickHouse, a CPU instance, a 24 GB GPU instance, and gp3 storage for RAID1-like durability, the baseline estimate is roughly:

ScenarioEstimate
no egress~$4,300-$6,300/mo
1 TB egress~$4,400-$6,400/mo
5 TB egress~$4,800-$6,800/mo

MediaConvert is the large variable. A 1.3M-video processing pass can add ~$4,875-$78,000 depending on duration and encoding assumptions. Running the system locally means buying and maintaining hardware, but the recurring bill is much easier to understand.

Why it works

Mogador works because boundaries are explicit. Scrapers discover. Queues dispatch. Workers claim, retry, and publish. Dedupe state is durable. GPU work is batched. Search and analytics live in separate stores. Every expensive stage can fall behind without corrupting the rest of the pipeline.

That is the main lesson from running it locally: the system works because every component has enough durable state around it to make failures inspectable. There is usually a queue, a claim, an event, a metric, or a dead-letter reason pointing at the next action.

hash: 66d
EOF