Fine-tuning Qwen3 at home to respond to any prompt with a dad joke

13 min readJust now

–

A guide to how I did LLM supervised fine-tuning on Reddit’s r/dadjokes dataset to teach Qwen3 about puns, sarcasm, and fart jokes.

Disclaimer: All the dad jokes in this post were generated using the LLM built here.

Demo: http://shutty.ddnss.de/ — runs on my homelab, uptime not guaranteed. TLDR — good prompt: “why did chicken cross the road?”, bad prompt: “say a joke”.

Modern LLMs are quite good at safety, but notoriously bad at humor — and yes, you’re absolutely right, these two things are definitely connected. This polite, helpful, and completely sterile behavior is baked into LLMs during the post-training stage, where they learn prompt-following by being fed endless streams of boring request–response conversatio…

13 min readJust now

–

A guide to how I did LLM supervised fine-tuning on Reddit’s r/dadjokes dataset to teach Qwen3 about puns, sarcasm, and fart jokes.

Disclaimer: All the dad jokes in this post were generated using the LLM built here.

Demo: http://shutty.ddnss.de/ — runs on my homelab, uptime not guaranteed. TLDR — good prompt: “why did chicken cross the road?”, bad prompt: “say a joke”.

If you post-train an LLM on a large dataset of math problems, it will become significantly better at math. But what happens if you post-train it on bad puns and fart jokes? There’s only one way to find out, and that’s exactly my plan for today.

Let’s make a conversational dataset based on Reddit semi-public corpus.
Then teach Qwen3 to follow a classical prompt+punchline dadjoke structure with a regular supervised fine-tuning.
Make it learn the difference between good and bad jokes with DPO: Direct Preference Optimization.
Evaluate the result with an LLM-as-a-Judge using GPT-5.2.

I trained an LLM to tell a dadjoke on any prompt. My dadjoke LLM is so good it’s a joke.

The dataset

The ultimate source of dad jokes is Reddit’s /r/dadjokes subreddit. Yes, it can be scraped, but there are plenty of unofficial Reddit corpus dumps available online (e.g. academic torrents).

The one I used contains 511k raw submissions and 2.5M comments. I’m not entirely sure how legal this is, but legality aside, it comes with some serious data quality issues:

Press enter or click to view image in full size

Typical /r/dadjoke post.

Not all comments are jokes. Some of them are, ahem, just comments.
Posts and comments are unstructured, have a lot of unrelated text, and need to be re-formatted in a strict intro+punchline layout suitable for fine-tuning.
A lot of duplicates: the joke about chicken crossing the road gets re-posted every month, and this will apparently continue until morale improves.

Instead of fighting this with regexes, I loaded Gemma 3–27B onto my GPU and asked it to parse title + body (and title + body + comment) tuples into intro and punchline fields using the following prompt:

Press enter or click to view image in full size

Gemma3–27B prompt used for dataset preprocessing.

This approach wasn’t very scientific and required quite a bit of back-and-forth to cover edge cases, but in the end it produced some pretty solid intro–punchline pairs.

Press enter or click to view image in full size

Source post, intro and punchline fields.

In addition to submissions, I also processed upvoted comments in a similar way to fish for extra jokes. Since not all comments are actually jokes, I ran a simple LLM-as-a-judge classifier to distinguish between joke and non-joke comments.

Press enter or click to view image in full size

LLM-Judge prompt for joke detection.

It’s far from perfect, but it got us to a dataset that’s good enough for fine-tuning:

53k intro–punchline pairs from submissions and comments
All pairs classified as jokes

The bartender said, “Why did the dad joke LLM go into a bar?” I said, “I don’t know.” He replied, “It was hoping for a pun-ishment!”

Supervised fine-tuning

The goal of the SFT stage is to teach the LLM to reproduce the response patterns present in the training dataset.

The dataset consists of pairs of prompts and ground-truth responses.
During training, the LLM is rewarded whenever it assigns higher probability to the next token from the ground-truth sequence.

At this stage, we’re not distinguishing between good and bad jokes. The objective is simply to teach the model the structure of a dad joke and expose it to a diverse set of puns.

Press enter or click to view image in full size

SFT tuning.

Being a lazy person, I opted for the easiest possible fine-tuning setup. In other words, doing as little manual work as possible:

Use the Hugging Face TRL library, which already provides SFTTrainer and DPOTrainer, so we don’t have to think too much about losses and tensor plumbing.
Use Hugging Face Datasets for loading and preprocessing the training data.
For scaling, rely on Accelerate for distributed training and PEFT library with 4-bit QLoRA to conserve precious VRAM.

An ML engineer walks into a bar and says that he trained a huge LLM. The bartender says “You must be a transformer” and pours him a drink.

For the base model to experiment with, I went with Qwen3, because why not:

It’s fast to train and well-supported in the HF ecosystem.
It comes in multiple sizes: 1.7B, 4B, 8B, 14B and 32B, which lets us explore how joke quality scales with model size.

My homelab

As a hobby, I maintain my own homelab server for all the LLM experiments like this one, and a monster looks like this:

Press enter or click to view image in full size

Yes case made of wood.

If you’re curious about the specs:

Gigabyte MZ32-AR0 SP3 motherboard with AMD EPYC Rome 7282 CPU. Bought the whole combo on eBay from a Chinese seller for 400$.
2x MSI RTX 5090 with 32GB VRAM. Around 2400$ each.
256G RDIMM DDR4 RAM, bought on eBay for ~400$ used, before the price surge of autumn 2025.
2x Linkup PCIe 4x riser cables, 70$ each. Leftovers of the previous homelab with 2x 4090 GPU.
Corsair HX1500i 1.5kW PSU, 300$ from Amazon.
DIY wooden case made from planks from OBI, 5$. And a lot of dust.

My dad build a very expensive homelab server for machine learning experiments. I guess that’s what you call a GPU-mented family.

The main downsides of this rig are power draw (around 1.3 kW peak), noise (I have to sleep with earplugs), and PCIe 4.0 instead of 5.0.

When training larger models, you generally have two options:

Data Parallel (DDP) training: The full model is loaded on each GPU, and gradients are synchronized at the end of each training step. This approach is fast, but you’re limited by the VRAM of a single GPU. In practice, this caps us at around 30B parameters with QLoRA.
Fully Sharded Data Parallel (FSDP, DeepSpeed3): The model is split into shards and distributed across multiple GPUs. This requires a large number of all-gather operations within each training step, but allows you to fit much larger models, as long as you have enough GPUs. DDP vs FSDP. Image from Intel Gaudi GPU docs.

Because GPU P2P isn’t available on my motherboard, FSDP/DeepSpeed-style distributed training isn’t really worth it here. Performance becomes bottlenecked by CPU–GPU transfers, and running FSDP on 2 GPUs ends up performing about the same as training on a single GPU.

As Qwen3 is just 32B params, it fits well into a single GPU, so the whole multi-GPU DDP scaling was done transparently with HF Accelerate:

accelerate launch --mixed-precision=bf16 train.py

Training setup

First step is to load the model and prepare it for training. And for bigger models VRAM is the most limiting factor.

My wife said that she will divorce me if I buy one more GPU for ML training. I told her that I will not buy anymore GPUs, but she is not buying it.

When you train an LLM, your VRAM is taken by 4 things:

Model weights. With mixed-precision training, you should expect roughly 16 bits per parameter. A 32B model therefore requires about 64 GB of VRAM just for the weights. There are some experimental approaches for direct 8-bit or 4-bit training, but I haven’t used them so far.
Activations. During the forward pass, the model needs to store 16- or 32-bit activation values for each layer so they can be reused during backpropagation.
Gradients. These represent the difference between the current and previous activations and are required by the optimizer to update the model weights. Gradients are typically stored in 16- or 32-bit precision.
Optimizer state. Modern AdamW-like optimizers maintain two additional values per parameter (first- and second-order moments of the gradients), usually stored as 8-, 16-, or 32-bit floats.

The memory required for activations and gradients also depends heavily on your training setup. In particular, it scales with:

Vocabulary size (typically 64k–256k tokens)
Sequence length (the combined length of prompt and response)
Batch size. Bigger — better HW utilization, but less training steps.

Press enter or click to view image in full size

VRAM layout for full weight training.

In practice, full training an 8B model (such as Qwen3–8B) requires around 60 GB of VRAM, even with the smallest possible batch size of 1 — which isn’t exactly easy with the hardware I have. But let’s cheat a little by using QLoRA:

We load the base model weights in int4 and never update or train them directly.
We add a small adapter on top of the model (typically 4M–16M parameters) which is the only part we actually train. This adapter represents a set of low-rank linear transformations applied to the underlying weights. If you’ve ever done SVD at university, the idea is very similar.
As a result, we only need to store activations, gradients, and optimizer state for this tiny adapter, dramatically reducing VRAM usage.

Press enter or click to view image in full size

QLoRa training memory layout.

In practice, with batch_size = 1, a Qwen3–32B model fits in roughly 20 GB of VRAM using QLoRA, which is exactly what we need, and still leaves plenty of headroom to increase the batch size.

SFTTrainer, the supervised fine-tuning swiss army knife

As a somewhat GPU poor, let’s load the base model weights in int4, while keeping the LoRA adapter in bfloat16 to enable mixed-precision training and save VRAM.

Press enter or click to view image in full size

Model loading with HF transformers.

In theory, we could keep the base weights in int8, but in my experience this doesn’t lead to any noticeable improvement in model quality, while consuming more VRAM — so it’s not worth it.

With that decided, let’s move on to configuring the trainer.

Press enter or click to view image in full size

A lot of parameters to tune!

There are quite a few parameters you tweak here, and all of them matter:

Gradient checkpointing. This reduces GPU memory usage by not storing all intermediate activations and instead recomputing them during backpropagation. It trades additional compute for lower memory consumption. In my anecdotal experience, this results in a 20–30% reduction in VRAM usage at the cost of roughly 10% slower training.
Warmup and LR scheduler. Starting with a high learning rate can shock the model with overly large weight updates. Instead, let’s gradually ramp up the learning rate over the first 5% of training steps, then smoothly decay it. This usually leads to slightly better training loss and more stable convergence.
Assistant-only loss. This is one of the most important settings. When training on conversational data, we only care about the assistant’s response being fluent and well-formed. If assistant_only_loss=False, the model is also trained on the intro part of the joke, which is typically written by the user - not what we want.
Custom chat_template. By default, Qwen3 does not mark which parts of the conversation are generated by the assistant, which causes assistant_only_loss to fail. We work around this by providing a custom chat template with additional markup to explicitly label assistant outputs.

Why you should warm-up your LLM before training? Otherwise it will be cold to you.

And we’re ready for a training run:

Press enter or click to view image in full size

SFTTrainer engage!

I also make a small change to the tokenizer:

By default, Qwen3 has no padding token, since it was originally trained on packed, non-padded sequences. So I just simply reused the EOS token as padding. In theory, sample packing might also work, but I haven’t gone down that rabbit hole yet.
Qwen3 also seems to perform best with right padding, at least according to various GitHub discussions. I’m not entirely sure how much this matters in practice, but 🤞.
I set fairly conservative LoRA hyperparameters r = 16 and α = 32, which are commonly recommended defaults if you’re not sure where to start. In practice, you’d want to run proper hyperparameter sweeps here, but not today.

Here are some graphs from the W&B dashboard for training a series of 1B, 4B, 8B, 14B and 32B models:

Press enter or click to view image in full size

The bigger, the better.

Unsurprisingly, larger models perform better. The 32B model achieves the lowest training and evaluation loss, as well as the highest evaluation token accuracy.

For the intro “how many Google engineers do you need to screw in a lightbulb?” there are the options the 32B model generates:

None. That’s a Microsoft problem.
None. The lightbulb is already a Google product.
None, Google engineers don’t do physical labor.
Just one, but he’ll probably spend weeks optimizing it.
Just one. He’ll hold the bulb and let the world revolve around him.

Jokes are quite OK, but we can do better.

Good and bad jokes

Our training dataset contains both good and bad jokes, but as we’re only interested in LOLing, so let’s mix human preferences to the model we’ve just built:

Reddit corpus has upvotes, which we can use as a good/bad label.
For each highly-upvoted intro+punchline pair from the SFT dataset, we also sample a single downvoted joke as a negative, non-funny example.
This way I got just 10K triplets, but it’s more than enough for a preference alignment pass.

Press enter or click to view image in full size

Another popular downvoted response is “I don’t get it”

So after teaching the SFT to follow basic dad-joke structure, let’s show the model difference between good and bad jokes via Direct Preference Optimization, the DPO.

Press enter or click to view image in full size

DPO in a nutshell.

When you perform a traditional SFT, you only show good route to the LLM, but don’t punish it specifically for generating well known negative responses.

With DPOTrainer and QLoRa things get a little bit tricky:

Press enter or click to view image in full size

Loading existing QLoRa adapter for further training.

We load the same adapter from the previous training stage, and pass it for further training.

Press enter or click to view image in full size

DPO Engage!

I was not able to perform classical DPO training (e.g. with reference_free=False) due to weird crashes inside PyTorch. But whatever:

Classical DPO computes how your positives and negatives examples deviate from the reference model (e.g. original model after the SFT). This helps with stability when you have noisy examples.
SimPO (or reference-free DPO) drops the reference part from equation. You just need to prefer positives over negatives. Much simpler, and no weird crashes.

Press enter or click to view image in full size

It seems that accuracy of choosing good vs bad jokes goes up significantly during the DPO post-training, which is a great sign.

For the same intro “how many Google engineers do you need to screw in a lightbulb?” here are the options:

Just one. But it will take him about 10 years to design a system that will allow him to do it efficiently.
None, they just make the darkness a feature.
None, the bulb just needs to be Google compatible.
Just one, but it’ll take two weeks to write the specs, four weeks to design it, eight weeks to code it, and then it’ll be deprecated.

Much more creative than just a regular SFT!

Running this thing

We spin up vLLM with OpenWebUI.

Press enter or click to view image in full size

Again, a lot of non-default parameters here.

How many OpenAI engineers you need to make a dad joke LLM model? Just one. They’re all pun-derful.

Beyond the expected flags like --enable-lora, there’s quite a bit of tinkering going on here:

Compilation config. There’s a regression in vLLM 0.11.x+ that causes OOM errors when using QLoRA with torch.compile. This configuration is essentially a workaround to make it run.
Tensor parallel. We load the model across both GPUs, partly for practical reasons, and partly so I can brag that I own two GPUs.
gpu_memory_utilization, max_num_seqs and max_model_len are used to cap VRAM usage and keep the model from blowing up memory.
Quantization. We use **bitsandbytes int4 **quantization to save VRAM and to match the precision used by the LoRA adapter.

And with that, we arrive at the ultimate LLM-generated dad joke:

Press enter or click to view image in full size

Evaluation

Evaluating a LPS rate (LOLs per second) is a tough task, so I just generated 3K dad jokes from the SFT/DPO test split and used GPT-5.2 as a judge to classify them as funny or not. To make it deterministic I’ve set a temperature=0 for both generation and evaluation stages.

Press enter or click to view image in full size

Scores were scaled to 0..1 range after generation.

With 0.5$ per evaluation run to help OpenAI with funding, I SFT-trained a set of Qwen3 models of different size: 1.7B, 4B-Instruct-2507, 8B, 14B and 32B. For the biggest 32B model I’ve also did a DPO pass.

LLM-Judge eval scores for dadjoke models.

Yes we can have a prompt-tuned baseline here (e.g. asking GPT5 “please continue a dad joke”), but I’m too lazy: this is not an academic publication. Yet.

Main conclusion from the numbers I got is quite obvious:

Size does matter. The bigger the size, the harder your smile.
Preference alignment is worth it when you have not just positive, but also negative examples. Noone likes non-funny jokes.

Demo

I host an instance of the vllm+openwebui on my homelab with the model: http://shutty.ddnss.de/ — so if it’s down, please forgive me.

Press enter or click to view image in full size

OpenWebUI interface.

What a chicken says when asked about his future plans? Egg-citing stuff!

Keep in mind that this model is not your favorite Instagram stand-up comedian. The jokes aren’t always 100% funny, but you can hit the “refresh” button a few times until something makes you smile.

The model is available on HuggingFace as an adaptor: https://huggingface.co/shuttie/Qwen3-32B-dadjokes-v3

If you found this interesting, feel free to subscribe to my Medium blog :)

The dataset

Supervised fine-tuning

My homelab

Training setup

SFTTrainer, the supervised fine-tuning swiss army knife

Good and bad jokes

Running this thing

Evaluation

Demo

Similar Posts