From 4M comments to a style-controlled comment generator
tl;dr summary
I cleaned and deduped 4M scraped comments, bootstrapped style labels with a DeBERTaV3 classifier + pseudo-labeling, then fine-tuned SmolLM3 with LoRA to generate comments in controllable styles.
table of contents
From 4M comments to a style-controlled comment generator
DeBERTaV3 pseudo-labeling, LoRA on SmolLM3, and a dedup pipeline that saved my sanity
I accidentally built the kind of dataset that sounds like a flex and behaves like a responsibility: 4+ million comments, each paired with a username, a short description of the content being commented on, and the comment itself.
The goal was not “train a general chatbot”. I wanted something narrower and more useful: generate believable comments in a specific style on demand.
The style set I stabilized on for this project:
happytoxicsarcasticcringewholesomenoise(special bucket: excluded from training later)
We will talk about scraping in another post. This one is about the part that actually hurt: cleaning, labeling, scaling labels, and turning it into a controllable generator.
The part nobody posts on Twitter (except me): cleaning the dataset
Before I labeled anything, I had to make sure I wasn’t labeling the same comment 50 times in different spellings.
Real-world scraped comments are full of:
- exact duplicates (copy paste, reposts, bots)
- near duplicates (same template, tiny edits)
- empty rows and whitespace junk
- “lol”, “ok”, “.”, and other short masterpieces
So I built a dedup and cleanup pass that runs directly on the SQLite database storing the comments.
What the cleaning pass actually did
1) Normalize every comment Before comparing anything, I normalized text by:
- collapsing repeated whitespace
- trimming edges
- lowercasing
This turns “Nice!!!”, “ nice!!! ” and “NICE!!!” into the same canonical string so the next steps work reliably.
2) Hard-delete very short comments Anything with length < 12 characters gets dropped immediately.
This removes a huge chunk of low-signal content: reaction noise, one-word replies, and little fragments that only make sense in-thread (and usually not even then).
3) Exact dedup (fast path) After normalization, I computed a fast hash key for each comment (xxHash) and used an in-memory “seen” set.
If the key was already seen, it was an exact duplicate and got deleted.
- Removed by exact dedup: 17k rows
4) Near dedup (the important part) Exact dedup is not enough because the internet loves templates.
So I used a MinHash + LSH approach with character n-grams, and treated comments as duplicates when their approximate Jaccard similarity crossed a high threshold.
This catches things like:
-
same sentence with emojis removed
-
repeated templates with a couple words changed
-
tiny variations that are still basically the same comment
-
Removed by near-dedup: 112k rows
5) Chunked processing + bulk deletes Everything ran in chunks (batch reads, buffered deletes) so it could process millions of rows without turning my machine into a smoke test.
What changed after cleaning
After cleaning, the dataset landed at:
- Final row count: 3.7M comments
- Average comment length: 44 characters
I also tracked distribution changes (length, emoji usage, etc.) as a sanity check. The point was not to sterilize the dataset, just to remove duplication and low-value sludge so the modeling steps would learn signal instead of repetition.
Labels: pick styles you can actually label
I kept the taxonomy intentionally small.
If you cannot define a label in one sentence, you do not have a label, you have a future argument with yourself.
I treated noise as a first-class concept: spam, contextless fragments, unreadable junk, and “I refuse to teach this to a model” all go there. Later, noise is excluded from generator training.
Multi-class only. One label per comment.
Step 1: hand-label ~1k comments
I labeled about 1,000 comments by hand.
This was just enough to train the first classifier. Not great, not terrible, and definitely enough to bootstrap the next step.
Step 2: fine-tune DeBERTaV3 base
I fine-tuned DeBERTaV3 base as a multi-class classifier on comment text only.
No fancy tricks. No special losses. The goal was not perfection, it was usefulness.
About the 85.22% accuracy number
I measured 85.22% accuracy on a validation set that was ~20% randomly sampled from the labeled data.
I mention it, but I do not treat it as a strong scientific result. Random splits can be optimistic if you have near-duplicates, repeated templates, or topical clustering that leaks across train and validation. It was mainly a “does this classifier work at all?” signal while iterating.
Step 3: model-assisted labeling (pseudo-labeling with a human in the loop)
This is where the project stops being “tiny dataset” and becomes “pipeline”.
Loop:
- sample a comment
- DeBERTa predicts a style
- I accept or correct
- retrain periodically
- repeat
I used a few strategies:
- single random samples with accept/correct
- bulk review batches
- once I had ~4k labels, I also generated comments in a target style, reviewed them, and reclassified bad generations into
noise
That last part was surprisingly useful: it exposed failure modes early and made noise a practical quality gate, not an afterthought.
This got me to about 8k labeled comments.
I also tried using a bigger LLM for labeling (gpt-oss-safeguard-20b), but it was too expensive compute-wise and not efficient enough for this workflow. The classifier plus human correction was simpler and faster.
Step 4: auto-label at scale, then filter hard
Once the classifier stabilized, I ran it over the cleaned corpus and kept only high-confidence labels:
- keep predictions with >= 70% probability
- if
noisehad > 40% confidence, drop the comment entirely - exclude
noisefrom the final training dataset
That produced:
- ~1.8M confidently labeled comments (minus noise)
This trade matters. It is better to train on fewer samples with strong labels than millions of weak guesses that blur style boundaries.
Step 5: train the generator (SmolLM3 + LoRA, SFT-only)
With ~1.8M filtered training samples, I trained a generator:
- Base model: SmolLM3 (3B)
- Training: SFT-only
- Fine-tune method: LoRA
- Hardware: RTX 4090
- Stack: Unsloth, latest CUDA, TensorRT for inference
This was not a “replace your assistant” model. It was a “produce believable comments in a requested style” model, and LoRA was the right tool for that job.
Prompt format
Training data was structured like chat:
- System prompt: style + a few guidelines
- User message:
<username>...</username><description>...</description>
- Assistant output: the comment text
No transformations. No complicated post-processing. Just clean conditioning and direct generation.
Evaluation: could I spot my own model?
I ran an arena-style human eval:
- “Real” comments were drawn from the same distribution as the prompts, so it was not an unfair mismatch.
- I was the only rater.
- Each trial: two candidate comments, I try to identify which one is model-generated.
Results:
- latest evaluation: n = 500
- I correctly identified the model about 57% of the time
- earlier quick checks were n = 100
57% is close enough to random guessing to be a meaningful win for this use case. It is not invisibility, but it is no longer “obvious machine text”.
Per-label F1 scores
I evaluated the style classifier on a held-out validation split (useful for iteration, not a publication-grade benchmark). Per-label F1:
happy: F1 = 0.84toxic: F1 = 0.82sarcastic: F1 = 0.80cringe: F1 = 0.81wholesome: F1 = 0.84noise: tracked separately, excluded from generator training
These scores were strong enough to bootstrap pseudo-labeling, but I still filtered aggressively by confidence to keep style boundaries crisp.
What actually made this work
- Cleaning first. Dedup + short-comment removal prevented the pipeline from learning repetition and noise.
- Pseudo-labeling as a multiplier. 1k labels became 8k quickly once the classifier could assist.
- Confidence filtering. 4M raw comments are impressive. 1.8M high-confidence labels are trainable.
- A strict
noisebucket. Excluding garbage is the difference between “style control” and “style soup”. - LoRA on a small model. Practical, cheap, and good enough to be hard to spot.