Question

What is DPO and how does it differ from RLHF?

Accepted Answer

DPO stands for Direct Preference Optimization. It achieves the same alignment goal as RLHF without the complexity of training a separate reward model and running reinforcement learning. DPO directly optimizes the language model on preference pairs using a simple classification-style loss. It is faster, more stable, and now widely used in place of full RLHF.