Limitations of RLHF in language modeling
While RLHF is powerful, it faces several challenges:
- Reward hacking: Models may exploit loopholes in the reward function
- Limited feedback: Human feedback may not cover all possible scenarios
- Suboptimal local optima: The optimization process may get stuck in suboptimal solutions
- Scaling issues: Obtaining high-quality human feedback at scale is challenging
To address reward hacking, consider implementing a constrained optimization approach:
def constrained_ppo_step( base_model, reward_model, constraint_model, optimizer, prompt, constraint_threshold=0.5 ): outputs = base_model.generate(prompt, max_length=100, return_dict_in_generate=True, output_scores=True ) generated_text = tokenizer.decode( outputs.sequences...