Reinforcement Learning from Human Feedback
In this chapter, we’ll dive into Reinforcement Learning from Human Feedback (RLHF), a powerful technique for aligning LLMs with human preferences. RLHF combines reinforcement learning with human feedback to fine-tune language models. It aims to align the model’s outputs with human preferences, improving the quality and safety of generated text.
RLHF differs from standard supervised fine-tuning by optimizing for human preferences rather than predefined correct answers. While supervised learning minimizes loss against labeled examples, RLHF creates a reward model from human comparisons between model outputs and then uses this reward function (typically with proximal policy optimization (PPO)) to update the model’s policy. The process typically employs a divergence penalty to prevent excessive drift from the initial model distribution.
The key benefits of RLHF are as follows:
- Improved alignment of models with...