What is DPO and how does it differ from RLHF?
LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning
Audio flashcard · 0:18Nortren·
What is DPO and how does it differ from RLHF?
0:18
DPO stands for Direct Preference Optimization. It achieves the same alignment goal as RLHF without the complexity of training a separate reward model and running reinforcement learning. DPO directly optimizes the language model on preference pairs using a simple classification-style loss. It is faster, more stable, and now widely used in place of full RLHF.
arxiv.org