Reinforcement Learning with Intrinsic Motivation
Last Updated :
24 Jun, 2025
In traditional reinforcement learning, an agent learns by receiving rewards from the environment for example, a robot gets a point for reaching a goal, or a game-playing AI gets a score for winning. These are called extrinsic rewards because they come from outside the agent and are directly tied to the task.
However, many real-world environments are sparse in rewards. Imagine a robot exploring a new building: it might only get a reward after finding a particular room, but there’s no feedback for all the steps it takes before that. In such cases, the agent might struggle to learn, because it doesn’t know which actions are useful until it stumbles upon a reward by chance.
Agent-Environment Interaction in RL. A: The usual view. B: An elaborationIntrinsic motivation in reinforcement learning (RL) enables agents to generate internal rewards for exploration and skill development when external rewards are sparse or absent. This approach mimics human curiosity, allowing agents to learn efficiently in complex environments without explicit incentive.
Core Idea
Extrinsic vs. intrinsic rewards:
- Extrinsic rewards (e.g., game points) are provided by the environment for task-specific goals.
- Intrinsic rewards are self-generated signals that encourage behaviors like novelty-seeking or curiosity-driven exploration.
Why it matters: In sparse-reward environments (e.g., robotic navigation), agents rarely receive meaningful feedback. Intrinsic motivation creates dense learning signals by rewarding exploration, uncertainty reduction, or skill mastery.
Biological Analogy: Humans explore driven by inherent curiosity (e.g., a child tinkering with toys). Similarly, RL agents use intrinsic rewards to explore "for its own sake," accelerating learning even without immediate external benefits.
Mathematical Frameworks
Intrinsic rewards are computed using these principles
1. Prediction Error (Curiosity)
Agents self-reward when their predictions fail. For a state st and action at
R^{\text{int}}_t = \eta \cdot \left\| \phi(s_{t+1}) - \hat{\phi}(s_{t+1}) \right\|^2
where:
- ϕ: State embedding (e.g., neural network features).
- \hat{\phi}: Predicted next-state embedding.
- \eta: Scaling factor.
- Intuition: Higher rewards for unpredictable outcomes (e.g., a robot entering an unseen room).
Rewards maximize information gain or minimize uncertainty:
R^{\text{int}}_t = \log\left( p(s_{t+1} \mid s_t, a_t) \right) - \log\left( q(s_{t+1} \mid s_t, a_t) \right)
where
- p: Agent’s current environment model.
- q: Prior model.
- Intuition: Agents seek surprising states (e.g., rarely visited maze paths).
3. Skill Acquisition
Agents reward competence-building:
R^{\text{int}}_{t} = I(a_{t} ; s_{t+k} \mid s_{t})
where I denotes mutual information. This rewards actions that influence future states.(e.g., a chess AI practicing openings).
Key Algorithms
Method | Mechanism | Use Case |
---|
ICM(Intrinsic Curiosity Module) | Predicts next state; reward = prediction error | Exploration in unfamiliar states |
---|
RND(Random Network Distillation) | Predicts output of a random neural network; reward = prediction error | High-dimensional environments |
---|
Count-Based | Rewards inversely proportional to state visitation frequency | Discrete state spaces |
---|
How It Works: Examples
- Maze-solving: An agent receives extrinsic rewards only at the exit. Intrinsic rewards incentivize exploring dead-ends, accelerating path discovery.
- Robotic grasping: A robot practices manipulating objects without task-specific rewards, building reusable skills.
Step-by-Step: IMRL in the Playroom Domain
A. Playroom Domain1. The Playroom Setup
The environment is a playroom containing various interactive objects:
- Green Ball (top left)
- Clown (second row, second column)
- Cake (third row, second column)
- Hand (top row, fourth column): likely represents the agent’s hand effector
- Eye (second row, fourth column): represents the agent’s eye effector
- Wine Glasses (second row, fifth column): possibly a reward or event trigger
- Music Player (third row, fifth column): controls music on/off
- Blue Play Button (bottom left): possibly a light switch or another event trigger
- Red Stop Button (bottom right): possibly another event trigger
The agent has three effectors: an eye, a hand and a visual marker.
The agent’s sensors report what object is under each effector and if both the eye and hand are on an object, context-specific actions become available (e.g., flicking the switch, kicking the ball).
2. No Prior Knowledge
- The agent starts with no knowledge of what objects do or how to interact with them.
- Unlike classic RL, there are no predefined action-value pairs or reward signals for most actions; the agent must discover everything through exploration.
3. Encountering Salient Events
- Example 1: The agent moves both its eye and hand to the music player and discovers that pressing it turns music on (a salient event).
- Example 2: It moves to the blue play button and discovers it controls the room’s lighting.
Each time the agent causes an unexpected change (music starts, light turns on, bell rings), it experiences a burst of intrinsic reward.
- Example 3: It interacts with the green ball and finds it can be kicked, causing the clown (bell) to ring and move.
4. Creating and Updating Skills (Options)
The first time a salient event occurs (e.g., light turns on), the agent creates:
- An option (a skill or sub-policy) for achieving that event (e.g., “turn-light-on option”).
- An option model to predict when the event will occur.
The agent adds this option to its Skill Knowledge Base (Skill-KB).
5. Learning Through Repetition
The agent is internally motivated to repeat the salient event to maximize its intrinsic reward. Each time the agent achieves the event, it updates:
- Its option policy (how to achieve the event)
- Its option model (predicting when the event will happen)
As the agent’s predictions improve, the intrinsic reward for that event diminishes (the agent gets "bored"). The agent then shifts focus to other unexplored or unpredictable events.
6. Hierarchical Skill Building
Many complex events (e.g., making the monkey cry out) require a sequence of simpler events (turning on the light, turning on music, ringing the bell). The agent bootstraps:
- It learns simple skills first (turn on light, ring bell).
- These skills become building blocks for more complex behaviors.
Over time, the agent develops a hierarchy of options from simple to complex.
7. Generalization and Transfer
- Once the agent has learned a repertoire of skills through intrinsic motivation, it can quickly solve extrinsically rewarded tasks that require those skills.
- For example, if the only extrinsic reward is for making the monkey cry out (a complex event), the intrinsically motivated agent solves this much faster than an agent relying solely on external rewards.
Inference
The main inference from research and practical applications is that intrinsic motivation enables RL agents to become more autonomous, adaptive and capable of open-ended learning. By rewarding themselves for exploration and skill mastery, agents can learn efficiently even when explicit goals are unclear or feedback is limited. This approach is especially valuable for real-world AI systems that must operate in complex, changing, or poorly defined environments.
B. Speed of learning of various skills C. The effect of intrinsically motivated learning when extrinsic reward is presentChallenges
- Distraction risk: Agents may fixate on unpredictable but irrelevant states (e.g., static noise on a TV screen).
- Balancing rewards: Combining intrinsic and extrinsic rewards optimally remains an open research problem.