Open In App

Model-Free Reinforcement Learning

Last Updated : 04 Jun, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Model-free Reinforcement Learning refers to methods where an agent directly learns from interactions without constructing a predictive model of the environment. The agent improves its decision-making through trial and error, using observed rewards to refine its policy. Model-free RL can be divided:

1. Value-Based Methods

Value-based methods learn a value function which estimates the expected cumulative reward for each state-action pair. The agent selects actions based on the highest expected value or reward. Its main algorithms are:

Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max Q(s', a') - Q(s, a)]

  • Deep Q-Networks (DQN) : It use a neural network to approximate the Q-value function allowing them to handle large state spaces like Atari games. They use experience replay to break correlations in training data and a target network to provide stable Q-value targets.

2. Policy-Based Methods

Instead of learning Q-values, policy-based methods directly optimize a policy function. The agent directly learns a policy that maps states to actions.

  • These methods optimize a policy π(a|s) directly using gradient descent.
  • The policy is updated using the gradient of expected rewards:

\nabla J(\theta) = \mathbb{E} [\nabla \log \pi_\theta(a|s) R]

3. Actor-Critic Methods (Combining Value-Based & Policy-Based)

Implementation of Model Free RL

We will use OpenAI Gymnasium (formerly OpenAI Gym) to implement Model-Free Reinforcement Learning using the Q-Learning algorithm. In RL an agent requires an environment that provides a simulation for learning optimal actions through trial and error.

To install Gymnasium use the following command:

!pip install gymnasium

1. Understand the Environment

We will use the Taxi-v3 environment from OpenAI Gymnasium a classic RL problem designed to train an agent to navigate a grid world and transport passengers efficiently.

Python
import gymnasium as gym
env = gym.make('Taxi-v3', render_mode='ansi')
env.reset()

print(env.render())

Output:

file
Taxi-V3 environment

The grid representation displays the taxi's position, passenger and destination.

2. Create the Q-Learning Agent

The Q-Learning Agent follows a table-based approach where Q-values for each state-action pair are updated based on interactions with the environment.

Python
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt


class QLearningAgent:
    def __init__(self, env, learning_rate, initial_epsilon, epsilon_decay, final_epsilon, discount_factor=0.95
                 ):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor

        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon

        # Initialize an empty dictionary of state-action values
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))

    def get_action(self, obs) -> int:
        x = np.random.rand()
        if x < self.final_epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_values[obs])

    def update(self, obs, action, reward, terminated, next_obs):
        if not terminated:
            future_q_value = np.max(self.q_values[next_obs])
            self.q_values[obs][action] += self.learning_rate * \
                (reward + self.discount_factor *
                 future_q_value - self.q_values[obs][action])

    def decay_epsilon(self):
        """Decrease the exploration rate epsilon until it reaches its final value"""
        self.epsilon = max(self.final_epsilon,
                           self.epsilon - self.epsilon_decay)

3. Define our training method

The train_agent function is responsible for training the provided Q-learning agent in a given environment over a specified number of episodes. The training steps are:

  • Reset the environment at the start of each episode.
  • Select an action using the agent's get_action method.
  • Execute the action, observe the next state, reward and termination condition.
  • Update the Q-values based on received rewards.
  • Decay the exploration rate gradually.
  • Track performance, calculating the best average reward
Python
def train_agent(agent, env, episodes, eval_interval=100):
    rewards = []
    best_reward = -np.inf
    for i in range(episodes):
        obs, _ = env.reset()
        terminated = False
        truncated = False
        length = 0
        total_reward = 0

        while (terminated == False) and (truncated == False):

            action = agent.get_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)

            agent.update(obs, action, reward, terminated, next_obs)
            obs = next_obs
            length = length + 1
            total_reward += reward

        agent.decay_epsilon()
        rewards.append(total_reward)

        if i >= eval_interval:
            avg_return = np.mean(rewards[i - eval_interval: i])
            best_reward = max(avg_return, best_reward)
        if i % eval_interval == 0 and i > 0:

            print(f"Episode{i} -> best_reward={best_reward} ")
    return rewards

4. Running our training method

Sets up parameters for training such as the number of episodes, learning rate, discount factor and exploration rates.

  • Creates the Taxi-v3 environment from OpenAI Gym.
  • Initializes a Q-learning agent (QLearningAgent) with the specified parameters.
  • Calls the train_agent function to train the agent using the specified environment and parameters.
Python
episodes = 20000
learning_rate = 0.5
discount_factor = 0.95
initial_epsilon = 1
final_epsilon = 0
epsilon_decay = ((final_epsilon - initial_epsilon) / (episodes / 2))
env = gym.make('Taxi-v3', render_mode='ansi')
agent = QLearningAgent(env, learning_rate, initial_epsilon,
                       epsilon_decay, final_epsilon)

returns = train_agent(agent, env, episodes)

Output:

Capture

5. Plotting our returns

We plot episode returns over time to observe improvements in agent performance.

Python
def plot_returns(returns):
    plt.plot(np.arange(len(returns)), returns)
    plt.title('Episode returns')
    plt.xlabel('Episode')
    plt.ylabel('Return')
    plt.show()


plot_returns(returns)

Output:

file
Plot of rewards

Plot shows rewards gradually increasing from negative values towards positive values indicating improved efficiency.

6. Running our Agent

The run_agent function allows the trained agent to navigate the Taxi-v3 environment using its learned Q-values. The execution steps are:

  1. Set epsilon =0 (agent fully exploits learned policy).
  2. Reset environment and render the initial state.
  3. Loop until termination, selecting optimal actions.
  4. Render updated states after each action.
Python
def run_agent(agent, env):
    agent.epsilon = 0
    obs, _ = env.reset()
    env.render()
    terminated = truncated = False

    while terminated == False and truncated == False:
        action = agent.get_action(obs)
        next_obs, _, terminated, truncated, _ = env.step(action)
        print(env.render())

        obs = next_obs


env = gym.make('Taxi-v3', render_mode='ansi')
run_agent(agent, env)

Output:

file
Output of the agent action

Challenges of Model-Free RL

  • Sample Inefficiency: Requires a large number of interactions with the environment to learn an optimal policy.
  • Exploration Challenges: Without an explicit model the agent may struggle to discover optimal strategies.
  • High Variance in Policy Gradients: Policy-based methods suffer from instability, require additional techniques like entropy regularization.

When to Use Model-Free RL

Use Model-Free RL when:

  • The environment is complex and unpredictable make model estimation difficult.
  • Real-time decision-making is required.
  • You have access to sufficient training data.

Avoid Model-Free RL when:

  • Sample efficiency is critical (model-based RL is better).
  • You have prior knowledge about the environment dynamics.

Model-Free Reinforcement Learning trains agents without requiring an explicit environment model. While it offers flexibility and robustness in complex environments it often suffers from sample inefficiency and high variance in learning.

You can download source code from here.


Practice Tags :

Similar Reads