Question

What is RLHF?

Accepted Answer

RLHF stands for Reinforcement Learning from Human Feedback. It is a multi-stage process: first train a reward model on human preference rankings of model outputs, then use reinforcement learning, typically PPO, to fine-tune the language model to produce outputs that maximize the reward. RLHF is what makes models like ChatGPT and Claude follow instructions and refuse harmful requests.