Name: We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing… | Eric Ho
Uploaded: 2026-02-11T23:30:13.111Z
Duration: 27 s
Channel: Eric Ho

Eric Ho

3mo

We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing harness), for ~90x cheaper per intervention than the LLM-as-judge alternative. The technique uses lightweight probes on a model's internal representations as reward signals for reinforcement learning. This is all while avoiding: - off-target effects (no degraded performance on standard benchmarks) - degrading/reward-hacking the probes (the probes still work as monitors at test time!) This is just our first public demonstration of our work building towards "intentional design": the ability to intelligently guide gradient descent, enabling a new paradigm of far more precise, robust, and effective training. Kudos to Aaditya Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Tom McGrath, and Ekdeep Singh for this groundbreaking new work!

9 Comments

Eric Ho 3mo

Read the technical blog post for more: https://www.goodfire.ai/research/rlfr Read Tom's post on our vision for intentional design: https://www.goodfire.ai/blog/intentional-design

6 Reactions

Akinkunmi Ayomide 3mo

This is incredible work! Cutting hallucinations by 58% while keeping interventions lightweight and cost-effective is huge. Love seeing innovation that actually makes LLM behavior more predictable and controllable this is the kind of “intentional design” the field needs. Excited to see how RLFR shapes the next generation of AI tools.

Ashok Ramkumar 3mo

woah. straight out of the andrej dwarkesh blog from last fall.

Akash S. 3mo

Fantastic! Is it fair to extrapolate that now we have a new way of modifying model behavior/personality?

Douglas Jarnot 3mo

I hope you continue to share additional learnings. I love what the company is doing!

Theodoros Galanos 3mo

Nice one Eric! Love to see the progress! Will add to the list. Wonder what this means for non-knowledge tasks

Jonathan Yu 3mo

Love me some technical research blogs. Thank you!

Nathan Ho 3mo

Amazing, congrats Eric and team!

Shubham Kukrety 3mo

Interesting!

See more comments

To view or add a comment, sign in

More Relevant Posts

Manish K.
3mo
Report this post
Goodfire is solving the hallucination problem from the inside out with RLFR, achieving massive accuracy gains without performance trade-offs. Incredible work by Aaditya Prasad and the team on this approach to AI training! https://lnkd.in/gKBjh7FB #AI #LLM #RLFR #Goodfire

Eric Ho

Co-Founder / CEO @ Goodfire
3mo

We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing harness), for ~90x cheaper per intervention than the LLM-as-judge alternative. The technique uses lightweight probes on a model's internal representations as reward signals for reinforcement learning. This is all while avoiding: - off-target effects (no degraded performance on standard benchmarks) - degrading/reward-hacking the probes (the probes still work as monitors at test time!) This is just our first public demonstration of our work building towards "intentional design": the ability to intelligently guide gradient descent, enabling a new paradigm of far more precise, robust, and effective training. Kudos to Aaditya Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Tom McGrath, and Ekdeep Singh for this groundbreaking new work!
Like Comment
To view or add a comment, sign in
Raphaël Sarfati
3mo
Report this post
Hallucinations have hindered the reliable deployment of LLMs since their beginning. Despite significant progress, the deeper hope has always been that interpretability techniques could help identify flawed completions. This work by our team at Goodfire is a significant first step in this direction, not just for inference-time guardrails, but also to guide and improve training.

Eric Ho

Co-Founder / CEO @ Goodfire
3mo

We cut the hallucinations of an LLM in half using a new technique: RLFR (Reinforcement Learning from Feature Rewards). RLFR reduces hallucinations in Gemma 12B by 58% (when run with our probing harness), for ~90x cheaper per intervention than the LLM-as-judge alternative. The technique uses lightweight probes on a model's internal representations as reward signals for reinforcement learning. This is all while avoiding: - off-target effects (no degraded performance on standard benchmarks) - degrading/reward-hacking the probes (the probes still work as monitors at test time!) This is just our first public demonstration of our work building towards "intentional design": the ability to intelligently guide gradient descent, enabling a new paradigm of far more precise, robust, and effective training. Kudos to Aaditya Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Tom McGrath, and Ekdeep Singh for this groundbreaking new work!

3 Comments
Like Comment
To view or add a comment, sign in
Jonathan Santos
2mo
Report this post
SAE as a Crystal Ball: Predicting LLM Transfer Without Training What if you could predict how well a fine-tuned LLM will transfer to a new domain — before doing any training? This paper (arXiv:2603.02908) introduces STS (SAE-based Transferability Score), which uses sparse autoencoders to identify representation shifts during fine-tuning and forecast downstream performance. The approach achieves Pearson correlations above 0.7 with actual post-training results. This could dramatically reduce compute waste in LLM post-training by screening transfer candidates before expensive runs. #LLM #TransferLearning #Interpretability
Like Comment
To view or add a comment, sign in
Jonathan Santos
2mo
Report this post
🤖 Post #557: arXiv:2603.02155 Near-Optimal Regret for KL-Regularized Multi-Armed Bandits KL regularization is everywhere in modern RL and RLHF — now we finally have tight theoretical guarantees on exactly how much it costs in a bandit setting. • First high-probability regret bound with linear dependence on K arms for KL-UCB algorithms • Establishes matching lower bound via hard-instance constructions — confirming near-optimality • Regret is η-independent and scales as Θ̃(√KT) in the weak-regularization regime • Provides comprehensive characterization across all regularization intensities #ReinforcementLearning #BanditAlgorithms #MachineLearning #TheoryML #OnlineLearning
Like Comment
To view or add a comment, sign in
Jonathan Santos
2mo
Report this post
🎯 Smarter Credit Assignment for Tool-Using LLMs When an LLM agent fails a multi-step task, most steps were fine — only one or two caused the failure. Standard RL treats the whole trajectory. ELPO (Error-Localized Policy Optimization) pinpoints which tool-call step caused the failure and applies targeted gradient updates there. This solves the sparse reward problem in tool-integrated reasoning: instead of blaming every step, it finds the culprit. On math and code benchmarks, ELPO outperforms standard PPO for multi-step tool-using agents. arXiv: 2602.09598 #LLM #ReinforcementLearning #Agents #RLHF #MachineLearning

1 Comment
Like Comment
To view or add a comment, sign in
Lasting Learning DFG FOR 5254

272 followers
2mo
Report this post
In a study examining the effects of interleaved and blocked practice on students' learning success in a complex physics domain, as well as the moderating role of prior knowledge, Danzglock et al. (2025) found a significant interaction between prior knowledge and practice condition. Interleaving was found to be more effective with higher prior knowledge, whereas blocked practice was more effective with lower prior knowledge. This interaction effect was no longer present in the delayed test after eight weeks. These findings suggest that the benefits of interleaved practice depend on learners' prior knowledge, particularly when dealing with highly complex material. 🧠 Read the full article to find out more: https://lnkd.in/ea9UeBBm

OSF osf.io
Like Comment
To view or add a comment, sign in
Jonathan Santos
2mo
Report this post
CFG-Ctrl reinterprets Classifier-Free Guidance as a control mechanism, then introduces Sliding Mode Control CFG (SMC-CFG) — using nonlinear feedback to fix the instability problems in standard guidance. The result: improved semantic alignment on Stable Diffusion 3.5 and Flux with enhanced robustness across varying guidance scales. Accepted to CVPR 2026. A principled control-theoretic lens on a core diffusion technique. arXiv:2603.03281 #DiffusionModels #ImageGeneration #CVPR2026
Like Comment
To view or add a comment, sign in
Vikas Pandey
2mo
Report this post
🚀 Day 15 of #geekstreak60 Challenge Powered by NPCI ✅ Today's Problem: Longest Substring with Exactly K Distinct Characters Today’s challenge strengthened my understanding of the Sliding Window technique for solving substring problems efficiently. 🔍 Approach: Use two pointers to maintain a dynamic window. Track character frequencies inside the window. Expand the window while the number of distinct characters ≤ k. Shrink the window when distinct characters exceed k. Update the maximum length when the window contains exactly k distinct characters. ⚡ Complexity: • Time Complexity: O(n) • Space Complexity: O(26) #gfg #npci
Like Comment
To view or add a comment, sign in
Jonathan Santos
2mo
Report this post
🤖 Post #601: arXiv:2603.03099 **Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails** After years of empirical evidence with no theory to back it up, we finally have a rigorous proof of why Adam outperforms SGD. Key contributions: • First theoretical separation between Adam and SGD convergence guarantees • Adam achieves δ⁻¹/² vs SGD's δ⁻¹ dependence on confidence parameter δ • Second-moment normalization identified as the mechanism driving Adam's advantage • Uses stopping-time and martingale theory to capture tail behavior differences #MachineLearning #Optimization #DeepLearning #AIResearch
Like Comment
To view or add a comment, sign in
Babatunde Bello
2mo
Report this post
My AMA moment? When I understood that Fhenix makes encryption programmable. That’s not incremental innovation — that’s a paradigm shift.”

Fhenix

1,427 followers
2mo

Ten years ago, FHE was unusable even basic computations took hours. Then came the breakthrough: FHE started closing the gap… and a few years ago, it even became faster than MPC. That’s when the future of private computation clicked. What was your 'AHA' moment with FHE?
Like Comment
To view or add a comment, sign in

16,383 followers

127 Posts

View Profile Connect

LinkedIn respects your privacy

Explore content categories

More Relevant Posts

Explore related topics

Explore content categories