What is RLHF?
LLM Engineer Interview Questions: Fine-Tuning, LoRA, QLoRA, PEFT, and Instruction Tuning
Audio flashcard · 0:19Nortren·
What is RLHF?
0:19
RLHF stands for Reinforcement Learning from Human Feedback. It is a multi-stage process: first train a reward model on human preference rankings of model outputs, then use reinforcement learning, typically PPO, to fine-tune the language model to produce outputs that maximize the reward. RLHF is what makes models like ChatGPT and Claude follow instructions and refuse harmful requests.
arxiv.org