Model-Free Reinforcement Learning
Last Updated :
04 Jun, 2025
Model-free Reinforcement Learning refers to methods where an agent directly learns from interactions without constructing a predictive model of the environment. The agent improves its decision-making through trial and error, using observed rewards to refine its policy. Model-free RL can be divided:
1. Value-Based Methods
Value-based methods learn a value function which estimates the expected cumulative reward for each state-action pair. The agent selects actions based on the highest expected value or reward. Its main algorithms are:
Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max Q(s', a') - Q(s, a)]
- Deep Q-Networks (DQN) : It use a neural network to approximate the Q-value function allowing them to handle large state spaces like Atari games. They use experience replay to break correlations in training data and a target network to provide stable Q-value targets.
2. Policy-Based Methods
Instead of learning Q-values, policy-based methods directly optimize a policy function. The agent directly learns a policy that maps states to actions.
- These methods optimize a policy π(a|s) directly using gradient descent.
- The policy is updated using the gradient of expected rewards:
\nabla J(\theta) = \mathbb{E} [\nabla \log \pi_\theta(a|s) R]
3. Actor-Critic Methods (Combining Value-Based & Policy-Based)
Implementation of Model Free RL
We will use OpenAI Gymnasium (formerly OpenAI Gym) to implement Model-Free Reinforcement Learning using the Q-Learning algorithm. In RL an agent requires an environment that provides a simulation for learning optimal actions through trial and error.
To install Gymnasium use the following command:
!pip install gymnasium
1. Understand the Environment
We will use the Taxi-v3 environment from OpenAI Gymnasium a classic RL problem designed to train an agent to navigate a grid world and transport passengers efficiently.
Python
import gymnasium as gym
env = gym.make('Taxi-v3', render_mode='ansi')
env.reset()
print(env.render())
Output:
Taxi-V3 environmentThe grid representation displays the taxi's position, passenger and destination.
2. Create the Q-Learning Agent
The Q-Learning Agent follows a table-based approach where Q-values for each state-action pair are updated based on interactions with the environment.
Python
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
class QLearningAgent:
def __init__(self, env, learning_rate, initial_epsilon, epsilon_decay, final_epsilon, discount_factor=0.95
):
self.env = env
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.epsilon = initial_epsilon
self.epsilon_decay = epsilon_decay
self.final_epsilon = final_epsilon
# Initialize an empty dictionary of state-action values
self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
def get_action(self, obs) -> int:
x = np.random.rand()
if x < self.final_epsilon:
return self.env.action_space.sample()
else:
return np.argmax(self.q_values[obs])
def update(self, obs, action, reward, terminated, next_obs):
if not terminated:
future_q_value = np.max(self.q_values[next_obs])
self.q_values[obs][action] += self.learning_rate * \
(reward + self.discount_factor *
future_q_value - self.q_values[obs][action])
def decay_epsilon(self):
"""Decrease the exploration rate epsilon until it reaches its final value"""
self.epsilon = max(self.final_epsilon,
self.epsilon - self.epsilon_decay)
3. Define our training method
The train_agent function is responsible for training the provided Q-learning agent in a given environment over a specified number of episodes. The training steps are:
- Reset the environment at the start of each episode.
- Select an action using the agent's
get_action
method. - Execute the action, observe the next state, reward and termination condition.
- Update the Q-values based on received rewards.
- Decay the exploration rate gradually.
- Track performance, calculating the best average reward
Python
def train_agent(agent, env, episodes, eval_interval=100):
rewards = []
best_reward = -np.inf
for i in range(episodes):
obs, _ = env.reset()
terminated = False
truncated = False
length = 0
total_reward = 0
while (terminated == False) and (truncated == False):
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
agent.update(obs, action, reward, terminated, next_obs)
obs = next_obs
length = length + 1
total_reward += reward
agent.decay_epsilon()
rewards.append(total_reward)
if i >= eval_interval:
avg_return = np.mean(rewards[i - eval_interval: i])
best_reward = max(avg_return, best_reward)
if i % eval_interval == 0 and i > 0:
print(f"Episode{i} -> best_reward={best_reward} ")
return rewards
4. Running our training method
Sets up parameters for training such as the number of episodes, learning rate, discount factor and exploration rates.
- Creates the Taxi-v3 environment from OpenAI Gym.
- Initializes a Q-learning agent (QLearningAgent) with the specified parameters.
- Calls the train_agent function to train the agent using the specified environment and parameters.
Python
episodes = 20000
learning_rate = 0.5
discount_factor = 0.95
initial_epsilon = 1
final_epsilon = 0
epsilon_decay = ((final_epsilon - initial_epsilon) / (episodes / 2))
env = gym.make('Taxi-v3', render_mode='ansi')
agent = QLearningAgent(env, learning_rate, initial_epsilon,
epsilon_decay, final_epsilon)
returns = train_agent(agent, env, episodes)
Output:
5. Plotting our returns
We plot episode returns over time to observe improvements in agent performance.
Python
def plot_returns(returns):
plt.plot(np.arange(len(returns)), returns)
plt.title('Episode returns')
plt.xlabel('Episode')
plt.ylabel('Return')
plt.show()
plot_returns(returns)
Output:
Plot of rewardsPlot shows rewards gradually increasing from negative values towards positive values indicating improved efficiency.
6. Running our Agent
The run_agent function allows the trained agent to navigate the Taxi-v3 environment using its learned Q-values. The execution steps are:
- Set epsilon =0 (agent fully exploits learned policy).
- Reset environment and render the initial state.
- Loop until termination, selecting optimal actions.
- Render updated states after each action.
Python
def run_agent(agent, env):
agent.epsilon = 0
obs, _ = env.reset()
env.render()
terminated = truncated = False
while terminated == False and truncated == False:
action = agent.get_action(obs)
next_obs, _, terminated, truncated, _ = env.step(action)
print(env.render())
obs = next_obs
env = gym.make('Taxi-v3', render_mode='ansi')
run_agent(agent, env)
Output:
Output of the agent actionChallenges of Model-Free RL
- Sample Inefficiency: Requires a large number of interactions with the environment to learn an optimal policy.
- Exploration Challenges: Without an explicit model the agent may struggle to discover optimal strategies.
- High Variance in Policy Gradients: Policy-based methods suffer from instability, require additional techniques like entropy regularization.
When to Use Model-Free RL
Use Model-Free RL when:
- The environment is complex and unpredictable make model estimation difficult.
- Real-time decision-making is required.
- You have access to sufficient training data.
Avoid Model-Free RL when:
- Sample efficiency is critical (model-based RL is better).
- You have prior knowledge about the environment dynamics.
Model-Free Reinforcement Learning trains agents without requiring an explicit environment model. While it offers flexibility and robustness in complex environments it often suffers from sample inefficiency and high variance in learning.
You can download source code from here.