BLOG_POST / smollm-finetuning-comment-generator

From 4M comments to a style-controlled comment generator

7 min read
1273 words
tl;dr summary

I cleaned and deduped 4M scraped comments, bootstrapped style labels with a DeBERTaV3 classifier + pseudo-labeling, then fine-tuned SmolLM3 with LoRA to generate comments in controllable styles.

From 4M comments to a style-controlled comment generator

DeBERTaV3 pseudo-labeling, LoRA on SmolLM3, and a dedup pipeline that saved my sanity

I accidentally built the kind of dataset that sounds like a flex and behaves like a responsibility: 4+ million comments, each paired with a username, a short description of the content being commented on, and the comment itself.

The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.

The style set I stabilized on for this project:

  • happy
  • toxic
  • sarcastic
  • cringe
  • wholesome
  • noise (special bucket: excluded from training later)

We will talk about scraping in another post. This one is about the part that actually hurt: cleaning, labeling, scaling labels, and turning it into a controllable generator.

The part nobody posts on Twitter (except me): cleaning the dataset

Before I labeled anything, I had to make sure I wasn’t labeling the same comment 50 times in different spellings.

Real-world scraped comments are full of:

  • exact duplicates (copy paste, reposts, bots)
  • near duplicates (same template, tiny edits)
  • empty rows and whitespace junk
  • “lol”, “ok”, “.”, and other short masterpieces

So I built a dedup and cleanup pass that runs directly on the SQLite database storing the comments.

What the cleaning pass actually did

1) Normalize every comment Before comparing anything, I normalized text by:

  • collapsing repeated whitespace
  • trimming edges
  • lowercasing

This turns “Nice!!!”, “ nice!!! ” and “NICE!!!” into the same canonical string so the next steps work reliably.

2) Hard-delete very short comments Anything with length < 12 characters gets dropped immediately.

This removes a huge chunk of low-signal content: reaction noise, one-word replies, and little fragments that only make sense in-thread (and usually not even then).

3) Exact dedup (fast path) After normalization, I computed a fast hash key for each comment (xxHash) and used an in-memory “seen” set.

If the key was already seen, it was an exact duplicate and got deleted.

  • Removed by exact dedup: 17k rows

4) Near dedup (the important part) Exact dedup is not enough because the internet loves templates.

So I used a MinHash + LSH approach with character n-grams, and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold.

This catches things like:

  • same sentence with emojis removed

  • repeated templates with a couple words changed

  • tiny variations that are still basically the same comment

  • Removed by near-dedup: 112k rows

5) Chunked processing + bulk deletes Everything ran in chunks (batch reads, buffered deletes) so it could process millions of rows without turning my machine into a smoke test.

What changed after cleaning

After cleaning, the dataset landed at:

  • Final row count: 3.7M comments
  • Average comment length: 44 characters

I also tracked distribution changes (length, emoji usage, etc.) as a sanity check. The point was not to sterilize the dataset, just to remove duplication and low-value sludge so the modeling steps would learn signal instead of repetition.

Labels: pick styles you can actually label

I kept the taxonomy intentionally small.

If you cannot define a label in one sentence, you do not have a label, you have a future argument with yourself.

I treated noise as a first-class concept: spam, contextless fragments, unreadable junk, and “I refuse to teach this to a model” all go there. Later, noise is excluded from generator training.

Multi-class only. One label per comment.

Step 1: hand-label ~1k comments

I labeled about 1,000 comments by hand.

This was just enough to train the first classifier. Not great, not terrible, and definitely enough to bootstrap the next step.

Step 2: fine-tune DeBERTaV3 base

I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.

No fancy tricks. No special losses. The goal was not perfection, it was usefulness.

About the 85.22% accuracy number

I measured 85.22% accuracy on a validation set that was ~20% randomly sampled from the labeled data.

I mention it, but I do not treat it as a strong scientific result. Random splits can be optimistic if you have near-duplicates, repeated templates, or topical clustering that leaks across train and validation. It was mainly a “does this classifier work at all?” signal while iterating.

Step 3: model-assisted labeling (pseudo-labeling with a human in the loop)

This is where the project stops being “tiny dataset” and becomes “pipeline”.

Loop:

  1. sample a comment
  2. DeBERTa predicts a style
  3. I accept or correct
  4. retrain periodically
  5. repeat

I used a few strategies:

  • single random samples with accept/correct
  • bulk review batches
  • once I had ~4k labels, I also generated comments in a target style, reviewed them, and reclassified bad generations into noise

That last part was surprisingly useful: it exposed failure modes early and made noise a practical quality gate, not an afterthought.

This got me to about 8k labeled comments.

I also tried using a bigger LLM for labeling (gpt-oss-safeguard-20b), but it was too expensive compute-wise and not efficient enough for this workflow. The classifier plus human correction was simpler and faster.

Step 4: auto-label at scale, then filter hard

Once the classifier stabilized, I ran it over the cleaned corpus and kept only high-confidence labels:

  • keep predictions with >= 70% probability
  • if noise had > 40% confidence, drop the comment entirely
  • exclude noise from the final training dataset

That produced:

  • ~1.8M confidently labeled comments (minus noise)

This trade matters. It is better to train on fewer samples with strong labels than millions of weak guesses that blur style boundaries.

Step 5: train the generator (SmolLM3 + LoRA, SFT-only)

With ~1.8M filtered training samples, I trained a generator:

  • Base model: SmolLM3 (3B)
  • Training: SFT-only
  • Fine-tune method: LoRA
  • Hardware: RTX 4090
  • Stack: Unsloth, latest CUDA, TensorRT for inference

This was not a “replace your assistant” model. It was a “produce believable comments in a requested style” model, and LoRA was the right tool for that job.

Prompt format

Training data was structured like chat:

  • System prompt: style + a few guidelines
  • User message:
<username>...</username><description>...</description>
  • Assistant output: the comment text

No transformations. No complicated post-processing. Just clean conditioning and direct generation.

Evaluation: could I spot my own model?

I ran an arena-style human eval:

  • “Real” comments were drawn from the same distribution as the prompts, so it was not an unfair mismatch.
  • I was the only rater.
  • Each trial: two candidate comments, I try to identify which one is model-generated.

Results:

  • latest evaluation: n = 500
  • I correctly identified the model about 57% of the time
  • earlier quick checks were n = 100

57% is close enough to random guessing to be a meaningful win for this use case. It is not invisibility, but it is no longer “obvious machine text”.

Per-label F1 scores

I evaluated the style classifier on a held-out validation split (useful for iteration, not a publication-grade benchmark). Per-label F1:

  • happy: F1 = 0.84
  • toxic: F1 = 0.82
  • sarcastic: F1 = 0.80
  • cringe: F1 = 0.81
  • wholesome: F1 = 0.84
  • noise: tracked separately, excluded from generator training

These scores were strong enough to bootstrap pseudo-labeling, but I still filtered aggressively by confidence to keep style boundaries crisp.

What actually made this work

  1. Cleaning first. Dedup + short-comment removal prevented the pipeline from learning repetition and noise.
  2. Pseudo-labeling as a multiplier. 1k labels became 8k quickly once the classifier could assist.
  3. Confidence filtering. 4M raw comments are impressive. 1.8M high-confidence labels are trainable.
  4. A strict noise bucket. Excluding garbage is the difference between “style control” and “style soup”.
  5. LoRA on a small model. Practical, cheap, and good enough to be hard to spot.
hash: e0c
EOF