Open In App

Univariate Time Series Analysis and Forecasting

Last Updated : 02 Jul, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Time series data is one of the most challenging tasks in machine learning as well as in the real-world problems related to data because the data entities not only depend on the physical factors but mostly on the order in which they have occurred. We can forecast a target value in the time series based on a single feature that is univariate and two features that are bivariate or multivariate.

In this article, we will learn how to perform univariate forecasts on the Rainfall dataset that has been taken from Kaggle.

Univariate Forecasting

Univariate forecasting is used when you want to make predictions for single variable especially when there are historical data points available for that variable. It's a widely applicable technique in fields like economics, finance, weather forecasting and demand forecasting in supply chain management.

For more complex forecasting tasks where multiple variables or external factors may have an impact, multivariate forecasting techniques are used. These models take into account multiple variables and their interactions for making predictions.

Key Concepts of Univariate Forecasting

  • Trend: A time series long-term movement or direction is represented by its trend. It displays the pattern in the data such as values rising or falling over time. Determining the trend is essential to analyze the variable's overall trajectory and producing precise forecasts.
  • Seasonality: Seasonality is the term used to describe periodic patterns that appear at regular intervals. For instance, seasonal patterns are frequently seen in retail sales because of weather-related factors or holidays. Taking seasonality into account is important for identifying trends and modifying forecasts appropriately.
  • Stationarity: When a time series statistical characteristics don't change over time, it's considered stationary. Since non-stationary data can produce inaccurate predictions, stationarity is a crucial premise in many forecasting models.
  • Time Series Data: Time series data or a series of observations taken over time at regular intervals. Sales numbers, temperature readings, GDP growth rates and stock prices are also some few examples.

Techniques of Univariate Forecasting

Several methods are used in univariate time series analysis to model and predict the behavior of a single variable over time. In univariate time series analysis, the following methods are frequently used:

  • Autoregression(AR): It makes use of the correlation between an observation and a predetermined number of lag observations (earlier time intervals).
  • Moving Average(MA): It uses a moving average model applied to lagged observations to model the relationship between an observation and a residual error.
  • Autoregressive Integrated Moving Average(ARIMA): It makes the time series stationary i.e the data has a consistent mean and variance over time. It works by combining the AR and MA methods and accounting for the differencing of raw observations.
  • Seasonal Autoregressive Integrated Moving Average(SARIMA): It extends ARIMA to take the time series data's seasonal component into consideration.
  • Exponential Smoothening(ETS): It uses a weighted average of historical observations to forecast the next time point, giving more weight to recent observations.
  • Long Short-Term Memory(LSTM): A kind of RNN that is specifically made to identify patterns in time series data over extended periods of time.

Implementation of Univariate Forecasting

1. Importing Libraries

Here we will import Pandas, Numpy, Matplotlib, Seaborn, Scikit Learn and statsmodal.

Python
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pandas.plotting import lag_plot
from statsmodels.tsa.stattools import adfuller

import warnings
warnings.filterwarnings('ignore')

2. Loading Dataset

Here, in this code we are loading the dataset into the pandas data frame so, that we can explore the different aspects of the dataset and you can download it from here.

Python
df = pd.read_csv('Rainfall_data.csv')
df.head()

Output:

UnivariateAnalysis1
Dataset Head

3. Shape of the dataframe

The DataFrame "df" rows and columns are counted are returned by this code.

Python
df.shape

Output:

(252, 7)

4. Data Information

By using the df.info() function we can see the content of each column and the data types present in it along with the number of null values present in each column.

Python
df.info()

Output:

UnivariateAnalysis3
Dataframe Information

5. Describing the data

The DataFrame df is described statistically with df. describe() function. It includes important statistics such as count, mean, standard deviation and minimum and maximum values for each numerical column.

Python
print(df.describe().T)

Output:

UnivariateAnalysis4
Column description

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover patterns or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we try to look at the relation between the independent features and how one affects the other.

Python
# Feature engineering
df['Date'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str) + '-' + df['Day'].astype(str),
                            format='%Y-%m-%d')
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Year'] = df['Date'].dt.year

# EDA
cat_cols = ['DayOfWeek', 'Month', 'Quarter', 'Year']

for col in cat_cols:
    df[['Precipitation', col]].groupby(col).mean().plot.bar()
    plt.title(f'Mean Precipitation by {col}')
    plt.show()

Output:

day-Geeksforgeeks
Precipitation by Days of the week

Interpretation:

  • The x-axis represents each day of the week (0 to 6 where 0 is Monday and 6 is Sunday).
  • The y-axis represents the mean precipitation for each respective day of the week.
  • Analyze which days of the week tend to have higher or lower mean precipitation based on the bars' heights. So, on the 5th day of the week the precipitation rate is higher than the other days.
month-Geeksforgeeks
Precipitation by months

Interpretation:

  • The x-axis represents each month of the year (1 to 12).
  • The y-axis represents the mean precipitation for each respective month.
  • Identify patterns in precipitation across different months. Some months may experience more precipitation than others. The precipitation was more in the 7th month.
quarter
Precipitation by Quarters


Interpretation:

  • The x-axis represents each quarter of the year (1 to 4).
  • The y-axis represents the mean precipitation for each respective quarter.
  • Explore seasonal variations by examining how precipitation averages differ across the quarters. This interprets that the precipitation was more in the 3rd quarter as compared to the other three quarters(1st, 2nd and 4th).
year-Geeksforgeeks
Precipitation by Years

Interpretation:

  • The x-axis represents each year.
  • The y-axis represents the mean precipitation for each respective year.
  • Evaluate any trends or changes in mean precipitation over the years. This interpretation shows that precipitation was more in 2019 as compared to the other years.

Using a time series dataset with rainfall data, this code does feature engineering. It takes data from the 'Date' column and adds additional temporal elements like 'DayOfWeek', 'Month', 'Quarter' and 'Year'

Exploratory data analysis (EDA) is then carried out with a focus on the mean precipitation for every unique value of the new temporal features. To evaluate the average precipitation patterns over the course of a week, month, quarter and year the code iterates through the categories columns and creates bar charts. By showcasing temporal trends and patterns in the rainfall data, these visualizations enable understanding of the features of the dataset.

Seasonal Decomposition

A statistical technique used in time series analysis to separate the constituent parts of a dataset is called seasonal decomposition. Three fundamental components of the time series are identified: trend, seasonality and residuals.

  • The long-term movement or direction is represented by the trend.
  • Repeating patterns at regular intervals are captured by seasonality
  • Random fluctuations are captured by residuals.

By separating the effects of seasonality from broader trends, decomposing a time series helps to comprehend the contributions of various components. This enables more accurate analysis and predictions.

Python
# Seasonal decomposition
ts = df.set_index('Date')['Precipitation'] + 0.01  # Add a small constant
result = seasonal_decompose(ts, model='multiplicative', period=12)
result.plot()
plt.show()

Output:

seasonal-Geeksforgeeks
Seasonal Decomposition

For Seasonal Component:

  • The upper part of the graph represents the seasonal component.
  • The x-axis corresponds to the time, usually in months given the specified period=12.
  • The y-axis represents the magnitude of the seasonal variations.

For Trend Component:

  • The middle part of the graph represents the trend component.
  • The x-axis corresponds to the time, reflecting the overall trend across the entire time series.
  • The y-axis represents the magnitude of the trend.

For Residual Component:

  • The bottom part of the graph represents the residual component (also known as the remainder).
  • The x-axis corresponds to the time.
  • The y-axis represents the difference between the observed values and the sum of the seasonal and trend components.

Autocorrelation and Partial Autocorrelation Plots

1. Autocorrelation: A time series' association with its lag values is measured by autocorrelation. Every lag is correlated and peaks in an autocorrelation diagram shows high correlation at particular delays. By revealing recurring patterns in the time series data, it aids in understanding its temporal structure and supports the choice of suitable model parameters for time series analysis.

2. Partial Autocorrelation: When measuring a variable's direct correlation with its lags, partial autocorrelation eliminates the impact of intermediate delays. Significant peaks in a Partial Autocorrelation Function (PACF) plot indicate that a lag has direct impact on the current observation. It helps to capture the distinct contribution of each lag by ordering of autoregressive components in time series modeling.

Python
# Autocorrelation and Partial Autocorrelation
plot_acf(ts, lags=30)
plot_pacf(ts, lags=30)
plt.show()

Output:

auto
Autocorelation

The ACF measures the correlation between a time series and its lagged values at different time intervals. In the ACF plot:

  • The x-axis represents the number of lags or time intervals.
  • The y-axis represents the correlation coefficient.

Interpretation:

  • Points above the blue shaded region are considered statistically significant.
  • Positive lags indicate a positive correlation between the current observation and past observations at that lag.
  • Negative lags indicate a negative correlation.
par-660
PACF Plot

The PACF measures the correlation between a time series and its lagged values, controlling for the effects of other lags. In the PACF plot:

  • The x-axis represents the number of lags or time intervals.
  • The y-axis represents the partial correlation coefficient.

Interpretation:

  • Points above the blue shaded region are considered statistically significant.
  • The partial autocorrelation at a specific lag represents the correlation between the current observation and past observations at that lag, excluding the influence of intermediate lags.
  • It helps identify the direct relationship between the current observation and a specific lag.

For a time series "ts" which represents precipitation data, the provided code creates plots of the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF). The correlation between the time series and its lag values is displayed in these charts which are restricted to 30 lags.

The ACF plot's peaks suggest strong connections at particular lags that may be seasonal. Finding appropriate autoregressive terms for time series modeling is made easier by the PACF plot which has direct impact of each lag on the current observation. These plots serve as a general reference for selecting model parameters and identifying temporal patterns.

Lag Plot

In time series analysis a lag plot is a graphical tool that shows the relationship between a variable and its lagged values. It helps find patterns, trends or unpredictability in the data by comparing each data point to its prior observation. The plot may show the presence of autocorrelation if there is a substantial correlation between a point and its lag. This can help with understanding the temporal dependencies and direct the selection of the best model for time series analysis.

Python
# Lag Plots
for i in range(1, 3):
    lag_plot(ts, lag=i)
    plt.title(f'Lag Plot with lag = {i}')
    plt.show()

Output:

lag1

The x-axis represents the values of the time series at time t and the y-axis represents the values of the time series at time t+1 (lag 1).

Interpretation:

  • If the points in the plot follow a well-defined pattern or trend, it suggests autocorrelation at lag 1.
  • If the points are randomly scattered, it indicates a lack of autocorrelation at lag 1.

lag2

Interpretation:

  • Examines the relationship between values at time t and values at time t+2.
  • Helps identify autocorrelation at lag 2.

With lags set to 1 and 2 the code creates Lag Plots for the time series "ts". The association between each data point and its preceding observation is plotted in each iteration. These representations make it easier to spot possible trends and relationships in the data.

Stationarity Check

Stationarity check is an essential step in time series analysis. Model predictions are made easier when a time series is stationary since its mean, variance and autocorrelation remain constant. Visual inspections like rolling statistics graphs and formal statistical tests like the Augmented Dickey-Fuller test are common approaches. By reducing the effect of non-constant patterns, stationarity assures dependable modeling and facilitates more precise forecasting and trend analysis of time series data.

Python
# Stationarity check
adf_result = adfuller(ts)
print('ADF Statistic:', adf_result[0])
print('p-value:', adf_result[1])

Output:

ADF Statistic: -2.4663249017732705
p-value: 0.12388427626757825

The code uses the Augmented Dickey-Fuller test to verify for stationarity on the time series data. The p-value and ADF Statistic are displayed.

The p-value denotes the importance of the ADF Statistic's estimate of the time series' presence of a unit root. For a more stationary time series, a low p-value and a more negative ADF Statistic point to more evidence against stationarity which ich helps determine whether differencing is necessary.

Rolling and Aggregations

1. Rolling: Rolling is a statistical method for time series analysis that computes summary statistics such as moving averages over successive subsets of a dataset. A fixed-size window traverses the data and a new value is computed based on the observations within that window at each step.

2. Aggregation: Aggregations are a common technique in time series analysis to identify broad trends. They merge and summarize several data points into a single value. Aggregations help the process of interpreting complicated time series data by combining related information.

Python
# Rolling and Aggregations
rolling_mean = ts.rolling(window=12).mean()
rolling_std = ts.rolling(window=12).std()

plt.plot(ts, label='Actual Data')
plt.plot(rolling_mean, label='Rolling Mean')
plt.plot(rolling_std, label='Rolling Std')
plt.legend()
plt.show()

Output:

roll-Geeksforgeeks

The blue line represents the original time series data.

Rolling Mean:

  • The orange line represents the rolling mean of the time series.
  • The rolling mean is calculated over a window of 12 data points (months in this case).

Interpretation:

  • If the rolling mean smoothens out the fluctuations in the actual data, it helps in identifying trends.
  • Rising or falling trends can be observed by comparing the rolling mean to the actual data.

Rolling Standard Deviation:

  • The green line represents the rolling standard deviation of the time series.
  • The rolling standard deviation is calculated over a window of 12 data points.

Interpretation:

  • Indicates the volatility or variability of the time series.
  • Peaks in the rolling standard deviation can signify periods of increased variability.

The code computes the rolling mean and rolling standard deviation with a window size of 12, performing rolling statistics on the time series 'ts'. The rolling mean, rolling standard deviation and actual data are superimposed on the generated graphs to help visualize patterns and variability. For easier understanding, the legend separates the rolling mean, rolling standard deviation and the original data.

Model Development

We will train a SARIMA model for the univariate forecast by using the date column as the feature for the predictions. But for that first, we will have to create a date column in the dataset that too in the pd. DateTime format so, we will be using the pd.to_datetime function that is available in the pandas dataframe.

Python
# Combine 'Year' and 'Month' columns to create a 'Date' column
df['Date'] = pd.to_datetime(df['Year'].astype(str) +\
                            '-' + df['Month'].astype(str) +\
                            '-' + df['Day'].astype(str),
                            format='%Y-%m-%d')
df.head()

Output:

Univariate-Analysis6
Date column additiion

Now let's set the index to the date column and the target column is the precipitation column let's separate it from the complete dataset.

Python
ts = df.set_index('Date')['Precipitation']
ts

Output:

Univariate-Analysis8
Date - Precipitation data

Training the SARIMA Model

Now let's train a SARIMA model on the dataset at hand.

Python
# Fit a SARIMA model
# We need to specify the order (p, d, q) and seasonal order (P, D, Q, S)
# Example order and seasonal order values
p, d, q = 1, 1, 1

# For monthly data with a yearly seasonality
P, D, Q, S = 1, 1, 1, 12  

model = sm.tsa.SARIMAX(ts, order=(p, d, q), seasonal_order=(P, D, Q, S))
results = model.fit()

As the model has been trained we can use this model to predict the rain for the next year and plot it along with the original data to get a feel for whether the predictions are following the previous trend or not.

This code fits the time series ts to a SARIMA model. The model's order is defined by the order=(p, d, q) argument where p denotes the number of autoregressive terms, d denotes the degree of differencing and q denotes the number of moving average terms.

  • The order of the seasonal component of the model is specified by the seasonal_order=(P, D, Q, S) argument where P denotes the number of seasonal autoregressive terms, D is the degree of seasonal differencing, Q is the number of seasonal moving average terms and S is the duration of the seasonality period.
  • A SARIMA model object is created by the code line model = sm.tsa.SARIMAX(ts, order=(p, d, q), seasonal_order=(P, D, Q, S)).
  • The time series that needs to be modeled is the 'ts' argument. The model's order and seasonal order are specified by the order=(p, d, q) and seasonal_order=(P, D, Q, S) arguments, respectively.

Predictions

Python
# Make predictions
forecast = results.get_forecast(steps=12)  # Forecast the next 12 periods
forecast_mean = forecast.predicted_mean

# Plot the actual data and the forecast
plt.figure(figsize=(12, 6))
plt.plot(ts, label='Actual Data')
plt.plot(forecast_mean, label='SARIMA Forecast')
plt.legend()
plt.show()

Output:

img8-(1)-Geeksforgeeks

This code plots the actual data and the forecast. It makes predictions using the fitted SARIMA model.


Similar Reads