SARSA (State-Action-Reward-State-Action) in Reinforcement Learning Last Updated : 12 Jul, 2025 Summarize Comments Improve Suggest changes Share Like Article Like Report SARSA (State-Action-Reward-State-Action) is an on-policy learning algorithm used for this purpose. It helps an agent learn an optimal policy based on experience where the agent improves its policy while continuously interacting with the environment. The key components of the RL process:State (S): The current state of the environment.Action (A): The action taken by the agent in a given state.Reward (R): The immediate reward received after taking action A in state S.Next State (S'): The state resulting from the action A in state S.Next Action (A'): The action chosen in the next state S' according to the current policy.The SARSA algorithm updates the Q-value for a given state-action pair based on the reward and the next state-action pair lead to an updated policy. SARSA Update RuleThe core idea of SARSA is to update the Q-value for each state-action pair based on the actual experience i.e what the agent does while following its policy. The Q-value is updated using the following Bellman Equation for SARSA:Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]Where:Q(s_t, a_t) is the current Q-value for the state-action pair at time step t.α is the learning rate (a value between 0 and 1) which determines how much the Q-values are updated.r_{t+1} is the reward received after taking action a_t in state s_t.γ is the discount factor (between 0 and 1) which determines the importance of future rewards.Q(s_{t+1}, a_{t+1}) is the Q-value for the next state-action pair.In this update rule:The agent updates its Q-value using both the immediate reward and the expected future reward.The difference between the predicted Q-value and the actual reward is used to correct the current estimate.SARSA AlgorithmHere’s the steps of the SARSA algorithm:1. Initialize:Initialize Q-values arbitrarily for each state-action pair.Choose an initial state s_0.2. Loop for each episode:Set the initial state s_t and choose an action a_t based on a policy (e.g., \varepsilon-greedy).Repeat for each step in the episode:Take action a_t, observe reward R_{t+1} and transition to state s_{t+1}.Choose the next action a_{t+1} from state s_{t+1} based on the policy.Update the Q-value for the state-action pair (s_t, a_t) using the SARSA update rule.Set s_t = s_{t+1} and a_t = a_{t+1}.Continue until the episode ends (when the agent reaches a terminal state or after a fixed number of steps).SARSA Implementation in Grid World Step 1: Define the Environment (GridWorld)The environment is a grid world where the agent can move up, down, left or right. The environment includes:A start position.A goal position.Obstacles that the agent should avoid.A reward structure like negative reward for hitting obstacles, positive reward for reaching the goal.The Grid World class handles the environment dynamics, including resetting the state and taking actions. Python import numpy as np import random class GridWorld: def __init__(self, width, height, start, goal, obstacles): self.width = width self.height = height self.start = start self.goal = goal self.obstacles = obstacles self.state = start def reset(self): self.state = self.start return self.state def step(self, action): x, y = self.state if action == 0: # Up x = max(x - 1, 0) elif action == 1: # Down x = min(x + 1, self.height - 1) elif action == 2: # Left y = max(y - 1, 0) elif action == 3: # Right y = min(y + 1, self.width - 1) next_state = (x, y) if next_state in self.obstacles: reward = -10 done = True elif next_state == self.goal: reward = 10 done = True else: reward = -1 done = False self.state = next_state return next_state, reward, done Step 2: Define the SARSA AlgorithmSARSA algorithm iterates over episodes, updating the Q-values based on the agent's experience. Python def sarsa(env, episodes, alpha, gamma, epsilon): # Initialize Q-table with zeros Q = np.zeros((env.height, env.width, 4)) for episode in range(episodes): state = env.reset() action = epsilon_greedy_policy(Q, state, epsilon) done = False while not done: next_state, reward, done = env.step(action) next_action = epsilon_greedy_policy(Q, next_state, epsilon) # SARSA update rule Q[state[0], state[1], action] += alpha * (reward + gamma * Q[next_state[0], next_state[1], next_action] - Q[state[0], state[1], action]) state = next_state action = next_action return Q Step 3: Define the Epsilon-Greedy PolicyThe epsilon-greedy policy balances exploration and exploitation:With probability ϵ the agent chooses a random action (exploration).With probability 1−ϵ the agent chooses the action with the highest Q-value for the current state (exploitation). Python def epsilon_greedy_policy(Q, state, epsilon): if random.uniform(0, 1) < epsilon: return random.randint(0, 3) # Random action else: return np.argmax(Q[state[0], state[1]]) # Greedy action Step 4: Set Up the Environment and Run SARSAThis step involves:Defining the grid world parameters (width, height, start, goal, obstacles).Setting the SARSA hyperparameters (episodes, learning rate, discount factor, exploration rate).Running the SARSA algorithm and printing the learned Q-values. Python if __name__ == "__main__": # Define the grid world environment width = 5 height = 5 start = (0, 0) goal = (4, 4) obstacles = [(2, 2), (3, 2)] env = GridWorld(width, height, start, goal, obstacles) # SARSA parameters episodes = 1000 alpha = 0.1 # Learning rate gamma = 0.99 # Discount factor epsilon = 0.1 # Exploration rate # Run SARSA Q = sarsa(env, episodes, alpha, gamma, epsilon) # Print the learned Q-values print("Learned Q-values:") print(Q) Output: After running the SARSA algorithm the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values indicate better actions for a given state.Exploration Strategies in SARSASARSA typically uses an exploration-exploitation strategy to choose actions. The most common strategy is ε-greedy where:Exploitation: With probability 1 - \epsilon, the agent chooses the action with the highest Q-value.Exploration: With probability \epsilon, the agent chooses a random action to explore new possibilities.Over time ϵ is often decayed (reduced) to shift from exploration to exploitation as the agent gains more experience in the environment.Strengths of SARSAStability with Exploration: SARSA is more stable in environments where the agent needs to balance exploration and exploitation.On-Policy Learning: Since SARSA learns from the policy the agent is actually following, it is better suited for situations where the agent must learn from its exploration.Weaknesses of SARSASlower Convergence: Because SARSA is on-policy, it might take longer to converge compared to off-policy methods like Q-learning, especially when there is a lot of exploration.Sensitive to the Exploration Strategy: The agent’s performance depends on how exploration is managed like how \epsilon is decayed in the ε-greedy strategy.Understanding SARSA is essential for building RL agents that can effectively learn from their interactions with the environment especially when an on-policy approach is required. Comment More infoAdvertise with us Next Article Multiple Linear Regression using Python - ML A AlindGupta Follow Improve Article Tags : Machine Learning python Practice Tags : Machine Learningpython Similar Reads Machine Learning Algorithms Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith 8 min read Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025 Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi 14 min read Linear Model RegressionOrdinary Least Squares (OLS) using statsmodelsOrdinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P 3 min read Linear Regression (Python Implementation)Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr 14 min read Multiple Linear Regression using Python - MLLinear regression is a statistical method used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression extends this concept by modelling the relationship between a dependen 4 min read Polynomial Regression ( From Scratch using Python )Prerequisites Linear RegressionGradient DescentIntroductionLinear Regression finds the correlation between the dependent variable ( or target variable ) and independent variables ( or features ). In short, it is a linear model to fit the data linearly. But it fails to fit and catch the pattern in no 5 min read Bayesian Linear RegressionLinear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the 10 min read How to Perform Quantile Regression in PythonIn this article, we are going to see how to perform quantile regression in Python. Linear regression is defined as the statistical method that constructs a relationship between a dependent variable and an independent variable as per the given set of variables. While performing linear regression we a 4 min read Isotonic Regression in Scikit LearnIsotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mat 6 min read Stepwise Regression in PythonStepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F 6 min read Least Angle Regression (LARS)Regression is a supervised machine learning task that can predict continuous values (real numbers), as compared to classification, that can predict categorical or discrete values. Before we begin, if you are a beginner, I highly recommend this article. Least Angle Regression (LARS) is an algorithm u 3 min read Linear Model ClassificationLogistic Regression in Machine LearningLogistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po 11 min read Understanding Activation Functions in DepthIn artificial neural networks, the activation function of a neuron determines its output for a given input. This output serves as the input for subsequent neurons in the network, continuing the process until the network solves the original problem. Consider a binary classification problem, where the 6 min read RegularizationImplementation of Lasso Regression From Scratch using PythonLasso Regression (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that combines prediction with feature selection. It does this by adding a penalty term to the cost function shrinking less relevant feature's coefficients to zero. This makes it effective for high-dim 7 min read Implementation of Ridge Regression from Scratch using PythonPrerequisites: Linear Regression Gradient Descent Introduction: Ridge Regression ( or L2 Regularization ) is a variation of Linear Regression. In Linear Regression, it minimizes the Residual Sum of Squares ( or RSS or cost function ) to fit the training examples perfectly as possible. The cost funct 4 min read Implementation of Elastic Net Regression From ScratchPrerequisites: Linear RegressionGradient DescentLasso & Ridge RegressionIntroduction: Elastic-Net Regression is a modification of Linear Regression which shares the same hypothetical function for prediction. The cost function of Linear Regression is represented by J. \frac{1}{m} \sum_{i=1}^{m}\l 5 min read K-Nearest Neighbors (KNN)Implementation of Elastic Net Regression From ScratchPrerequisites: Linear RegressionGradient DescentLasso & Ridge RegressionIntroduction: Elastic-Net Regression is a modification of Linear Regression which shares the same hypothetical function for prediction. The cost function of Linear Regression is represented by J. \frac{1}{m} \sum_{i=1}^{m}\l 5 min read Brute Force Approach and its pros and consIn this article, we will discuss the Brute Force Algorithm and what are its pros and cons. What is the Brute Force Algorithm?A brute force algorithm is a simple, comprehensive search strategy that systematically explores every option until a problem's answer is discovered. It's a generic approach to 3 min read Implementation of KNN classifier using Scikit - learn - PythonK-Nearest Neighbors is a most simple but fundamental classifier algorithm in Machine Learning. It is under the supervised learning category and used with great intensity for pattern recognition, data mining and analysis of intrusion. It is widely disposable in real-life scenarios since it is non-par 3 min read Regression using k-Nearest Neighbors in R ProgrammingThe K-Nearest Neighbors (K-NN) is a machine learning algorithm used for both classification and regression tasks. It is a lazy learner algorithm, meaning it doesnât build an explicit model during training. Instead, it stores the training data and uses it for prediction when new data points need to b 8 min read Support Vector MachinesSupport Vector Machine (SVM) AlgorithmSupport Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or 9 min read Classifying data using Support Vector Machines(SVMs) in PythonIntroduction to SVMs: In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. A Support Vector Machine (SVM) is a discriminative classifier 4 min read Support Vector Regression (SVR) using Linear and Non-Linear Kernels in Scikit LearnSupport vector regression (SVR) is a type of support vector machine (SVM) that is used for regression tasks. It tries to find a function that best predicts the continuous output value for a given input value. SVR can use both linear and non-linear kernels. A linear kernel is a simple dot product bet 5 min read Major Kernel Functions in Support Vector Machine (SVM)In previous article we have discussed about SVM(Support Vector Machine) in Machine Learning. Now we are going to learn in detail about SVM Kernel and Different Kernel Functions and its examples.Types of SVM Kernel FunctionsSVM algorithm use the mathematical function defined by the kernel. Kernel Fu 4 min read ML - Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d 8 min read Decision TreeMajor Kernel Functions in Support Vector Machine (SVM)In previous article we have discussed about SVM(Support Vector Machine) in Machine Learning. Now we are going to learn in detail about SVM Kernel and Different Kernel Functions and its examples.Types of SVM Kernel FunctionsSVM algorithm use the mathematical function defined by the kernel. Kernel Fu 4 min read CART (Classification And Regression Tree) in Machine LearningCART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called âgrowingâ trees). CART was first produced b 11 min read Decision Tree Classifiers in R ProgrammingDecision Tree is a machine learning algorithm that assigns new observations to predefined categories based on a training dataset. Its goals are to predict class labels for unseen data and identify the features that define each class. It has a flowchart-like tree structure in which the internal node 6 min read Decision Tree Regression using sklearn - PythonDecision Tree Regression is a method used to predict continuous values like prices or scores by using a tree-like structure. It works by splitting the data into smaller parts based on simple rules taken from the input features. These splits help reduce errors in prediction. At the end of each branch 4 min read Ensemble LearningEnsemble Methods in PythonEnsemble means a group of elements viewed as a whole rather than individually. An Ensemble method creates multiple models and combines them to solve it. Ensemble methods help to improve the robustness/generalizability of the model. In this article, we will discuss some methods with their implementat 11 min read Random Forest Regression in PythonA random forest is an ensemble learning method that combines the predictions from multiple decision trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that can be used for both classification and regression tasks.In regression task we can use Random 7 min read ML | Extra Tree Classifier for Feature SelectionPrerequisites: Decision Tree Classifier Extremely Randomized Trees Classifier(Extra Trees Classifier) is a type of ensemble learning technique which aggregates the results of multiple de-correlated decision trees collected in a "forest" to output it's classification result. In concept, it is very si 6 min read Implementing the AdaBoost Algorithm From ScratchAdaBoost means Adaptive Boosting which is a ensemble learning technique that combines multiple weak classifiers to create a strong classifier. It works by sequentially adding classifiers to correct the errors made by previous models giving more weight to the misclassified data points. In this articl 4 min read XGBoostTraditional machine learning models like decision trees and random forests are easy to interpret but often struggle with accuracy on complex datasets. XGBoost short form for eXtreme Gradient Boosting is an advanced machine learning algorithm designed for efficiency, speed and high performance.It is 6 min read CatBoost in Machine LearningWhen working with machine learning we often deal with datasets that include categorical data. We use techniques like One-Hot Encoding or Label Encoding to convert these categorical features into numerical values. However One-Hot Encoding can lead to sparse matrix and cause overfitting. This is where 5 min read LightGBM (Light Gradient Boosting Machine)LightGBM is an open-source high-performance framework developed by Microsoft. It is an ensemble learning framework that uses gradient boosting method which constructs a strong learner by sequentially adding weak learners in a gradient descent manner.It's designed for efficiency, scalability and high 7 min read Stacking in Machine LearningStacking is a ensemble learning technique where the final model known as the âstacked model" combines the predictions from multiple base models. The goal is to create a stronger model by using different models and combining them.Architecture of StackingStacking architecture is like a team of models 3 min read Like