RLHF vs DPO vs IPO vs KTO: which alignment method should you use (opens in new tab)
RLHF vs DPO vs IPO vs KTO: which alignment method should you use You have a base model, say Llama 3.2 8B, that can write poetry in any meter and pass the bar exam. It can also generate instructions for synthesizing controlled substances, roleplay as a manipulative therapist, and explain in loving detail why your pull request is an affront to good taste. You need to align it — remove the harmful outputs while keeping the capability. Your mentor says "use RLHF." A paper on your feed says DPO is...
Read the original article