We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing harness), for ~90x cheaper per intervention than the LLM-as-judge alternative. The technique uses lightweight probes on a model's internal representations as reward signals for reinforcement learning. This is all while avoiding: - off-target effects (no degraded performance on standard benchmarks) - degrading/reward-hacking the probes (the probes still work as monitors at test time!) This is just our first public demonstration of our work building towards "intentional design": the ability to intelligently guide gradient descent, enabling a new paradigm of far more precise, robust, and effective training. Kudos to Aaditya Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Tom McGrath, and Ekdeep Singh for this groundbreaking new work!
This is incredible work! Cutting hallucinations by 58% while keeping interventions lightweight and cost-effective is huge. Love seeing innovation that actually makes LLM behavior more predictable and controllable this is the kind of “intentional design” the field needs. Excited to see how RLFR shapes the next generation of AI tools.
woah. straight out of the andrej dwarkesh blog from last fall.
Fantastic! Is it fair to extrapolate that now we have a new way of modifying model behavior/personality?
I hope you continue to share additional learnings. I love what the company is doing!
Nice one Eric! Love to see the progress! Will add to the list. Wonder what this means for non-knowledge tasks
Love me some technical research blogs. Thank you!
Amazing, congrats Eric and team!
Interesting!
Read the technical blog post for more: https://www.goodfire.ai/research/rlfr Read Tom's post on our vision for intentional design: https://www.goodfire.ai/blog/intentional-design