Model-Free Reinforcement Learning

Last Updated : 04 Jun, 2025

Model-free Reinforcement Learning refers to methods where an agent directly learns from interactions without constructing a predictive model of the environment. The agent improves its decision-making through trial and error, using observed rewards to refine its policy. Model-free RL can be divided:

1. Value-Based Methods

Value-based methods learn a value function which estimates the expected cumulative reward for each state-action pair. The agent selects actions based on the highest expected value or reward. Its main algorithms are:

Q-Learning : It is one of the most popular model-free algorithms. The Q-value function is updated using the Bellman equation:

Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max Q(s', a') - Q(s, a)]

Deep Q-Networks (DQN) : It use a neural network to approximate the Q-value function allowing them to handle large state spaces like Atari games. They use experience replay to break correlations in training data and a target network to provide stable Q-value targets.

2. Policy-Based Methods

Instead of learning Q-values, policy-based methods directly optimize a policy function. The agent directly learns a policy that maps states to actions.

These methods optimize a policy π(a|s) directly using gradient descent.
The policy is updated using the gradient of expected rewards:

\nabla J(\theta) = \mathbb{E} [\nabla \log \pi_\theta(a|s) R]

Example: REINFORCE Algorithm like Monte Carlo Policy Gradient.

3. Actor-Critic Methods (Combining Value-Based & Policy-Based)

Actor: Learns the policy (action selection).
Critic: Learns the value function to evaluate actions.
Example: Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic (A3C) and Proximal Policy Optimization (PPO).

Implementation of Model Free RL

We will use OpenAI Gymnasium (formerly OpenAI Gym) to implement Model-Free Reinforcement Learning using the Q-Learning algorithm. In RL an agent requires an environment that provides a simulation for learning optimal actions through trial and error.

To install Gymnasium use the following command:

!pip install gymnasium

1. Understand the Environment

We will use the Taxi-v3 environment from OpenAI Gymnasium a classic RL problem designed to train an agent to navigate a grid world and transport passengers efficiently.

Python

import gymnasium as gym
env = gym.make('Taxi-v3', render_mode='ansi')
env.reset()

print(env.render())

Output:

The grid representation displays the taxi's position, passenger and destination.

2. Create the Q-Learning Agent

The Q-Learning Agent follows a table-based approach where Q-values for each state-action pair are updated based on interactions with the environment.

Python

import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt


class QLearningAgent:
    def __init__(self, env, learning_rate, initial_epsilon, epsilon_decay, final_epsilon, discount_factor=0.95
                 ):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor

        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon

        # Initialize an empty dictionary of state-action values
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))

    def get_action(self, obs) -> int:
        x = np.random.rand()
        if x < self.final_epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_values[obs])

    def update(self, obs, action, reward, terminated, next_obs):
        if not terminated:
            future_q_value = np.max(self.q_values[next_obs])
            self.q_values[obs][action] += self.learning_rate * \
                (reward + self.discount_factor *
                 future_q_value - self.q_values[obs][action])

    def decay_epsilon(self):
        """Decrease the exploration rate epsilon until it reaches its final value"""
        self.epsilon = max(self.final_epsilon,
                           self.epsilon - self.epsilon_decay)

3. Define our training method

The train_agent function is responsible for training the provided Q-learning agent in a given environment over a specified number of episodes. The training steps are:

Reset the environment at the start of each episode.
Select an action using the agent's get_action method.
Execute the action, observe the next state, reward and termination condition.
Update the Q-values based on received rewards.
Decay the exploration rate gradually.
Track performance, calculating the best average reward

Python

def train_agent(agent, env, episodes, eval_interval=100):
    rewards = []
    best_reward = -np.inf
    for i in range(episodes):
        obs, _ = env.reset()
        terminated = False
        truncated = False
        length = 0
        total_reward = 0

        while (terminated == False) and (truncated == False):

            action = agent.get_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)

            agent.update(obs, action, reward, terminated, next_obs)
            obs = next_obs
            length = length + 1
            total_reward += reward

        agent.decay_epsilon()
        rewards.append(total_reward)

        if i >= eval_interval:
            avg_return = np.mean(rewards[i - eval_interval: i])
            best_reward = max(avg_return, best_reward)
        if i % eval_interval == 0 and i > 0:

            print(f"Episode{i} -> best_reward={best_reward} ")
    return rewards

4. Running our training method

Sets up parameters for training such as the number of episodes, learning rate, discount factor and exploration rates.

Creates the Taxi-v3 environment from OpenAI Gym.
Initializes a Q-learning agent (QLearningAgent) with the specified parameters.
Calls the train_agent function to train the agent using the specified environment and parameters.

Python

episodes = 20000
learning_rate = 0.5
discount_factor = 0.95
initial_epsilon = 1
final_epsilon = 0
epsilon_decay = ((final_epsilon - initial_epsilon) / (episodes / 2))
env = gym.make('Taxi-v3', render_mode='ansi')
agent = QLearningAgent(env, learning_rate, initial_epsilon,
                       epsilon_decay, final_epsilon)

returns = train_agent(agent, env, episodes)

Output:

5. Plotting our returns

We plot episode returns over time to observe improvements in agent performance.

Python

def plot_returns(returns):
    plt.plot(np.arange(len(returns)), returns)
    plt.title('Episode returns')
    plt.xlabel('Episode')
    plt.ylabel('Return')
    plt.show()


plot_returns(returns)

Output:

Plot shows rewards gradually increasing from negative values towards positive values indicating improved efficiency.

6. Running our Agent

The run_agent function allows the trained agent to navigate the Taxi-v3 environment using its learned Q-values. The execution steps are:

Set epsilon =0 (agent fully exploits learned policy).
Reset environment and render the initial state.
Loop until termination, selecting optimal actions.
Render updated states after each action.

Python

def run_agent(agent, env):
    agent.epsilon = 0
    obs, _ = env.reset()
    env.render()
    terminated = truncated = False

    while terminated == False and truncated == False:
        action = agent.get_action(obs)
        next_obs, _, terminated, truncated, _ = env.step(action)
        print(env.render())

        obs = next_obs


env = gym.make('Taxi-v3', render_mode='ansi')
run_agent(agent, env)

Output:

Challenges of Model-Free RL

Sample Inefficiency: Requires a large number of interactions with the environment to learn an optimal policy.
Exploration Challenges: Without an explicit model the agent may struggle to discover optimal strategies.
High Variance in Policy Gradients: Policy-based methods suffer from instability, require additional techniques like entropy regularization.

When to Use Model-Free RL

Use Model-Free RL when:

The environment is complex and unpredictable make model estimation difficult.
Real-time decision-making is required.
You have access to sufficient training data.

Avoid Model-Free RL when:

Sample efficiency is critical (model-based RL is better).
You have prior knowledge about the environment dynamics.

Model-Free Reinforcement Learning trains agents without requiring an explicit environment model. While it offers flexibility and robustness in complex environments it often suffers from sample inefficiency and high variance in learning.

You can download source code from here.

Deep Q-Learning in Reinforcement Learning

rahulsm27

Improve

Article Tags :

Practice Tags :

Machine Learning

Model-Free Reinforcement Learning

1. Value-Based Methods

2. Policy-Based Methods

3. Actor-Critic Methods (Combining Value-Based & Policy-Based)

Implementation of Model Free RL

1. Understand the Environment

2. Create the Q-Learning Agent

3. Define our training method

4. Running our training method

5. Plotting our returns

6. Running our Agent

Challenges of Model-Free RL

When to Use Model-Free RL

Similar Reads

Thank You!

What kind of Experience do you want to share?