From 4M comments to a style-controlled comment generator

DeBERTaV3 pseudo-labeling, LoRA on SmolLM3, and a dedup pipeline that saved my sanity

I accidentally built the kind of dataset that sounds like a flex and behaves like a responsibility: 4+ million comments, each paired with a username, a short description of the content being commented on, and the comment itself.

The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.

The style set I stabilized on for this project:

happy
toxic
sarcastic
cringe
wholesome
noise (special bucket: excluded from training later)

We will talk about scraping in another post. This one is about the part that actually hurt: cleaning, labeling, scaling labels, and turning it into a controllable generator.

The part nobody posts on Twitter (except me): cleaning the dataset

Before I labeled anything, I had to make sure I wasn’t labeling the same comment 50 times in different spellings.

Real-world scraped comments are full of:

exact duplicates (copy paste, reposts, bots)
near duplicates (same template, tiny edits)
empty rows and whitespace junk
“lol”, “ok”, “.”, and other short masterpieces

So I built a dedup and cleanup pass that runs directly on the SQLite database storing the comments.

What the cleaning pass actually did

1) Normalize every comment Before comparing anything, I normalized text by:

collapsing repeated whitespace
trimming edges
lowercasing

This turns “Nice!!!”, “ nice!!! ” and “NICE!!!” into the same canonical string so the next steps work reliably.

2) Hard-delete very short comments Anything with length < 12 characters gets dropped immediately.

This removes a huge chunk of low-signal content: reaction noise, one-word replies, and little fragments that only make sense in-thread (and usually not even then).

3) Exact dedup (fast path) After normalization, I computed a fast hash key for each comment (xxHash) and used an in-memory “seen” set.

If the key was already seen, it was an exact duplicate and got deleted.

Removed by exact dedup: 17k rows

4) Near dedup (the important part) Exact dedup is not enough because the internet loves templates.

So I used a MinHash + LSH approach with character n-grams, and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold.

This catches things like:

same sentence with emojis removed
repeated templates with a couple words changed
tiny variations that are still basically the same comment
Removed by near-dedup: 112k rows

5) Chunked processing + bulk deletes Everything ran in chunks (batch reads, buffered deletes) so it could process millions of rows without turning my machine into a smoke test.

What changed after cleaning

After cleaning, the dataset landed at:

Final row count: 3.7M comments
Average comment length: 44 characters

I also tracked distribution changes (length, emoji usage, etc.) as a sanity check. The point was not to sterilize the dataset, just to remove duplication and low-value sludge so the modeling steps would learn signal instead of repetition.

Labels: pick styles you can actually label

I kept the taxonomy intentionally small.

If you cannot define a label in one sentence, you do not have a label, you have a future argument with yourself.

I treated noise as a first-class concept: spam, contextless fragments, unreadable junk, and “I refuse to teach this to a model” all go there. Later, noise is excluded from generator training.

Multi-class only. One label per comment.

Step 1: hand-label ~1k comments

I labeled about 1,000 comments by hand.

This was just enough to train the first classifier. Not great, not terrible, and definitely enough to bootstrap the next step.

Step 2: fine-tune DeBERTaV3 base

I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.

No fancy tricks. No special losses. The goal was not perfection, it was usefulness.

About the 85.22% accuracy number

I measured 85.22% accuracy on a validation set that was ~20% randomly sampled from the labeled data.

I mention it, but I do not treat it as a strong scientific result. Random splits can be optimistic if you have near-duplicates, repeated templates, or topical clustering that leaks across train and validation. It was mainly a “does this classifier work at all?” signal while iterating.

Step 3: model-assisted labeling (pseudo-labeling with a human in the loop)

This is where the project stops being “tiny dataset” and becomes “pipeline”.

Loop:

sample a comment
DeBERTa predicts a style
I accept or correct
retrain periodically
repeat

I used a few strategies:

single random samples with accept/correct
bulk review batches
once I had ~4k labels, I also generated comments in a target style, reviewed them, and reclassified bad generations into noise

That last part was surprisingly useful: it exposed failure modes early and made noise a practical quality gate, not an afterthought.

This got me to about 8k labeled comments.

I also tried using a bigger LLM for labeling (gpt-oss-safeguard-20b), but it was too expensive compute-wise and not efficient enough for this workflow. The classifier plus human correction was simpler and faster.

Step 4: auto-label at scale, then filter hard

Once the classifier stabilized, I ran it over the cleaned corpus and kept only high-confidence labels:

keep predictions with >= 70% probability
if noise had > 40% confidence, drop the comment entirely
exclude noise from the final training dataset

That produced:

~1.8M confidently labeled comments (minus noise)

This trade matters. It is better to train on fewer samples with strong labels than millions of weak guesses that blur style boundaries.

Step 5: train the generator (SmolLM3 + LoRA, SFT-only)

With ~1.8M filtered training samples, I trained a generator:

Base model: SmolLM3 (3B)
Training: SFT-only
Fine-tune method: LoRA
Hardware: RTX 4090
Stack: Unsloth, latest CUDA, TensorRT for inference

This was not a “replace your assistant” model. It was a “produce believable comments in a requested style” model, and LoRA was the right tool for that job.

Prompt format

Training data was structured like chat:

System prompt: style + a few guidelines
User message:

<username>...</username><description>...</description>

Assistant output: the comment text

No transformations. No complicated post-processing. Just clean conditioning and direct generation.

Evaluation: could I spot my own model?

I ran an arena-style human eval:

“Real” comments were drawn from the same distribution as the prompts, so it was not an unfair mismatch.
I was the only rater.
Each trial: two candidate comments, I try to identify which one is model-generated.

Results:

latest evaluation: n = 500
I correctly identified the model about 57% of the time
earlier quick checks were n = 100

57% is close enough to random guessing to be a meaningful win for this use case. It is not invisibility, but it is no longer “obvious machine text”.

Per-label F1 scores

I evaluated the style classifier on a held-out validation split (useful for iteration, not a publication-grade benchmark). Per-label F1:

happy: F1 = 0.84
toxic: F1 = 0.82
sarcastic: F1 = 0.80
cringe: F1 = 0.81
wholesome: F1 = 0.84
noise: tracked separately, excluded from generator training

These scores were strong enough to bootstrap pseudo-labeling, but I still filtered aggressively by confidence to keep style boundaries crisp.

What actually made this work

Cleaning first. Dedup + short-comment removal prevented the pipeline from learning repetition and noise.
Pseudo-labeling as a multiplier. 1k labels became 8k quickly once the classifier could assist.
Confidence filtering. 4M raw comments are impressive. 1.8M high-confidence labels are trainable.
A strict noise bucket. Excluding garbage is the difference between “style control” and “style soup”.
LoRA on a small model. Practical, cheap, and good enough to be hard to spot.