We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing harness), for ~90x cheaper per intervention than the LLM-as-judge alternative. The technique uses lightweight probes on a model's internal representations as reward signals for reinforcement learning. This is all while avoiding: - off-target effects (no degraded performance on standard benchmarks) - degrading/reward-hacking the probes (the probes still work as monitors at test time!) This is just our first public demonstration of our work building towards "intentional design": the ability to intelligently guide gradient descent, enabling a new paradigm of far more precise, robust, and effective training. Kudos to Aaditya Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Tom McGrath, and Ekdeep Singh for this groundbreaking new work!

This is incredible work! Cutting hallucinations by 58% while keeping interventions lightweight and cost-effective is huge. Love seeing innovation that actually makes LLM behavior more predictable and controllable this is the kind of “intentional design” the field needs. Excited to see how RLFR shapes the next generation of AI tools.

Like
Reply

woah. straight out of the andrej dwarkesh blog from last fall.

Like
Reply

Fantastic! Is it fair to extrapolate that now we have a new way of modifying model behavior/personality?

Like
Reply

I hope you continue to share additional learnings. I love what the company is doing!

Like
Reply

Nice one Eric! Love to see the progress! Will add to the list. Wonder what this means for non-knowledge tasks

Like
Reply

Love me some technical research blogs. Thank you!

Like
Reply

Amazing, congrats Eric and team!

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories