SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

Last Updated : 12 Jul, 2025

SARSA (State-Action-Reward-State-Action) is an on-policy learning algorithm used for this purpose. It helps an agent learn an optimal policy based on experience where the agent improves its policy while continuously interacting with the environment. The key components of the RL process:

State (S): The current state of the environment.
Action (A): The action taken by the agent in a given state.
Reward (R): The immediate reward received after taking action A in state S.
Next State (S'): The state resulting from the action A in state S.
Next Action (A'): The action chosen in the next state S' according to the current policy.

The SARSA algorithm updates the Q-value for a given state-action pair based on the reward and the next state-action pair lead to an updated policy.

SARSA Update Rule

The core idea of SARSA is to update the Q-value for each state-action pair based on the actual experience i.e what the agent does while following its policy. The Q-value is updated using the following Bellman Equation for SARSA:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]

Where:

Q(s_t, a_t) is the current Q-value for the state-action pair at time step t.
α is the learning rate (a value between 0 and 1) which determines how much the Q-values are updated.
r_{t+1} is the reward received after taking action a_t in state s_t.
γ is the discount factor (between 0 and 1) which determines the importance of future rewards.
Q(s_{t+1}, a_{t+1}) is the Q-value for the next state-action pair.

In this update rule:

The agent updates its Q-value using both the immediate reward and the expected future reward.
The difference between the predicted Q-value and the actual reward is used to correct the current estimate.

SARSA Algorithm

Here’s the steps of the SARSA algorithm:

1. Initialize:

Initialize Q-values arbitrarily for each state-action pair.
Choose an initial state s_0.

2. Loop for each episode:

Set the initial state s_t and choose an action a_t based on a policy (e.g., \varepsilon-greedy).
- Repeat for each step in the episode:
- Take action a_t, observe reward R_{t+1} and transition to state s_{t+1}.
- Choose the next action a_{t+1} from state s_{t+1} based on the policy.
- Update the Q-value for the state-action pair (s_t, a_t) using the SARSA update rule.
- Set s_t = s_{t+1} and a_t = a_{t+1}.
Continue until the episode ends (when the agent reaches a terminal state or after a fixed number of steps).

SARSA Implementation in Grid World

Step 1: Define the Environment (GridWorld)

The environment is a grid world where the agent can move up, down, left or right. The environment includes:

A start position.
A goal position.
Obstacles that the agent should avoid.
A reward structure like negative reward for hitting obstacles, positive reward for reaching the goal.

The Grid World class handles the environment dynamics, including resetting the state and taking actions.

Python

import numpy as np
import random

class GridWorld:
    def __init__(self, width, height, start, goal, obstacles):
        self.width = width
        self.height = height
        self.start = start
        self.goal = goal
        self.obstacles = obstacles
        self.state = start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        x, y = self.state
        if action == 0:  # Up
            x = max(x - 1, 0)
        elif action == 1:  # Down
            x = min(x + 1, self.height - 1)
        elif action == 2:  # Left
            y = max(y - 1, 0)
        elif action == 3:  # Right
            y = min(y + 1, self.width - 1)

        next_state = (x, y)

        if next_state in self.obstacles:
            reward = -10
            done = True
        elif next_state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1
            done = False

        self.state = next_state
        return next_state, reward, done

Step 2: Define the SARSA Algorithm

SARSA algorithm iterates over episodes, updating the Q-values based on the agent's experience.

Python

def sarsa(env, episodes, alpha, gamma, epsilon):
    # Initialize Q-table with zeros
    Q = np.zeros((env.height, env.width, 4))

    for episode in range(episodes):
        state = env.reset()
        action = epsilon_greedy_policy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done = env.step(action)
            next_action = epsilon_greedy_policy(Q, next_state, epsilon)

            # SARSA update rule
            Q[state[0], state[1], action] += alpha * (reward + gamma * Q[next_state[0], next_state[1], next_action] - Q[state[0], state[1], action])

            state = next_state
            action = next_action

    return Q

Step 3: Define the Epsilon-Greedy Policy

The epsilon-greedy policy balances exploration and exploitation:

With probability ϵ the agent chooses a random action (exploration).
With probability 1−ϵ the agent chooses the action with the highest Q-value for the current state (exploitation).

Python

def epsilon_greedy_policy(Q, state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)  # Random action
    else:
        return np.argmax(Q[state[0], state[1]])  # Greedy action

Step 4: Set Up the Environment and Run SARSA

This step involves:

Defining the grid world parameters (width, height, start, goal, obstacles).
Setting the SARSA hyperparameters (episodes, learning rate, discount factor, exploration rate).
Running the SARSA algorithm and printing the learned Q-values.

Python

if __name__ == "__main__":
    # Define the grid world environment
    width = 5
    height = 5
    start = (0, 0)
    goal = (4, 4)
    obstacles = [(2, 2), (3, 2)]
    env = GridWorld(width, height, start, goal, obstacles)

    # SARSA parameters
    episodes = 1000
    alpha = 0.1  # Learning rate
    gamma = 0.99  # Discount factor
    epsilon = 0.1  # Exploration rate

    # Run SARSA
    Q = sarsa(env, episodes, alpha, gamma, epsilon)

    # Print the learned Q-values
    print("Learned Q-values:")
    print(Q)

Output:

After running the SARSA algorithm the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values indicate better actions for a given state.

Exploration Strategies in SARSA

SARSA typically uses an exploration-exploitation strategy to choose actions. The most common strategy is ε-greedy where:

Exploitation: With probability 1 - \epsilon, the agent chooses the action with the highest Q-value.
Exploration: With probability \epsilon, the agent chooses a random action to explore new possibilities.

Over time ϵ is often decayed (reduced) to shift from exploration to exploitation as the agent gains more experience in the environment.

Strengths of SARSA

Stability with Exploration: SARSA is more stable in environments where the agent needs to balance exploration and exploitation.
On-Policy Learning: Since SARSA learns from the policy the agent is actually following, it is better suited for situations where the agent must learn from its exploration.

Weaknesses of SARSA

Slower Convergence: Because SARSA is on-policy, it might take longer to converge compared to off-policy methods like Q-learning, especially when there is a lot of exploration.
Sensitive to the Exploration Strategy: The agent’s performance depends on how exploration is managed like how \epsilon is decayed in the ε-greedy strategy.

Understanding SARSA is essential for building RL agents that can effectively learn from their interactions with the environment especially when an on-policy approach is required.

Multiple Linear Regression using Python - ML

AlindGupta

Improve

Article Tags :

Practice Tags :

SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

SARSA Update Rule

SARSA Algorithm

SARSA Implementation in Grid World

Step 1: Define the Environment (GridWorld)

Step 2: Define the SARSA Algorithm

Step 3: Define the Epsilon-Greedy Policy

Step 4: Set Up the Environment and Run SARSA

Exploration Strategies in SARSA

Strengths of SARSA

Weaknesses of SARSA

Similar Reads

Linear Model Regression

Linear Model Classification

Regularization

K-Nearest Neighbors (KNN)

Support Vector Machines

Decision Tree

Ensemble Learning

Thank You!

What kind of Experience do you want to share?