Lecture9_Time_Series_2024_and_data_analysis (1).pdf

2031ICT Data Analytics
Methods
Lecture 9: Time Series and
Sequence Data Analysis

Outline
§ Time-series analysis
§ Definition
§ Types of analysis
§ Sequence data analysis
§ Sequence pattern mining
§ Natural language processing

Stream data
§ There are two types of stream data:
• Time series
• Sequence data
• A time series is a series of data points indexed in time order.
• Usually, a time series is a sequence taken at successive spaced points in time.
• Sequence data is a series of ordered elements or events recorded with or without a
concrete notion of time.
• Therefore, time series can be considered a special sequence data case.

Time series
§ A Time series is often plotted via a run chart that is a graph that displays observed
data in a time sequence.
§ Data is recorded at regular intervals.
§ Time series examples include weather data, heart rate monitoring (EKG), brain
monitoring (EEG), quarterly sales, stock prices, industry forecasts, interest rates,
and largely in any domain of applied science and engineering which involves
temporal measurements.

Method of time series analysis
§ The methods can be divided into frequency-domain methods and time-domain
methods.
• Frequency: the number of occurrences of a repeating event per unit of time
https://www.youtube.com/watch?v=fYtVHhk3xJ0

Method of time series analysis
§ The methods can also be divided parametric and non-parametric methods.
§ The parametric approaches assume that the underlying process (progresses in
time) has a certain structure which can be described using a small number of
parameters (e.g., mean and deviation). The task is to estimate the parameters
of the model.
§ Non-parametric approaches explicitly estimate the properties of the process
without assuming that the process has any particular structure.
§ Methods of time series analysis may also be divided into linear (regression) and
non-linear (regression).

Motivation of time series
§ In the context of statistics, econometrics, quantitative finance, seismology,
meteorology, and geophysics the primary goal of time series analysis is
forecasting. The goal is to predict what will happen.
§ In the context of signal processing, control engineering and communication
engineering, it is used for signal detection. The goal is to detect anomalies,
events, and their reasons.
§ Other applications are in data mining, pattern recognition and machine learning,
where time series analysis can be used for clustering, classification, query by
content, anomaly detection as well as forecasting.

Type of time series analysis
§ Descriptive analysis: Identifies patterns in time series data, like trends and cycles;
Highlights the main characteristics of the time series data, usually in a visual format
§ Curve fitting: Plots the data along a curve to study the relationships of variables
within the data
§ Prediction and forecasting: Predicts future data based on historical trends
§ Classification: Identifies and assigns categories to the data
§ Segmentation: Splits the data into segments to show the underlying properties of
the source information
§ Anomaly detection: Finds outlier data or events

Descriptive (exploratory) analysis
§ Compute descriptive statistical
measures: mean, medium, max,
min etc
• These statistics do not take into
account the time and thus all
the points are considered to be
equivalent - time series reduce
to common dataset.
§ Trend, cycle, seasonal
• Requires mathematical
modelling
Marcel Dettling, Applied Time Series Analysis, 2020

Curve fitting
§ Curve fitting is the process of constructing a curve,
or mathematical function, that has the best fit to a
series of data points, possibly subject to constraints.
§ Curve fitting can involve either:
§ Interpolation, where an exact fit to the data is
required, or smoothing, in which a "smooth"
function is constructed that approximately fits the
data; or
§ Extrapolation which fits curve beyond the range
of the observed data, and is subject to a degree of
uncertainty since it may reflect the method used to
construct the curve as much as it reflects the
observed data.

Curve fitting
§ The construction of an economic time series involves the estimation of some
components for some dates by interpolation between values for earlier and later
dates.
§ Interpolation is useful where the data surrounding the missing data is available and
its trend, seasonality, and longer-term cycles are known.
§ Alternatively, polynomial interpolation or spline interpolation is used where
piecewise polynomial functions fit into time intervals so that they fit smoothly
together.
§ Difference between polynomial regression and polynomial interpolation
§ polynomial regression gives a single polynomial that models the entire data set.
§ polynomial interpolation yields a piecewise continuous function composed of
many polynomials to model the data set.

Curve fitting
https://stackoverflow.com/questions/30433391/how-can-i-produce-multi-point-linear-interpolation

Prediction and forecasting
§ Time series forecasting is one of the most widely applied data science techniques in
business, finance, supply chain management, production and inventory planning.
Many prediction problems involve a time component and thus require extrapolation
of time series data, or time series forecasting.
§ Time series forecasting is also an important area of machine learning (ML) and can
be cast as a supervised learning problem. ML methods such as Regression, Neural
Networks, Support Vector Machines — can be applied to it. Forecasting involves
taking models fit on historical data and using them to predict future observations.

§ Time series forecasting means to forecast or to predict the future value over a
period of time. It entails developing models based on previous data and applying
them to make observations and guide future strategic decisions.
§ The future is forecast or estimated based on what has already happened. Time
series adds a time order dependence between observations. This dependence is
both a constraint and a structure that provides a source of additional information.
Before we discuss time series forecasting methods, let’s define time series
forecasting more closely.
Prediction and forecasting

Goals of forecasting
§ When forecasting, it is important to understand your goal. To narrow down the
specifics of your predictive modelling problem, ask questions about:
• Volume of data available — more data is often more helpful, offering greater
opportunity for exploratory data analysis, model testing and tuning.
• Required time horizon of predictions — shorter time horizons are often easier to
predict — with higher confidence — than longer ones.
• Forecast update frequency — Forecasts might need to be updated frequently
over time (updating forecasts as new information becomes available often
results in more accurate predictions).
• Forecast temporal frequency — Often forecasts can be made at lower or higher
frequencies, which allows harnessing down-sampling and up-sampling of data

Methods of forecasting - Decompositional Method
• Time series data can exhibit a
variety of patterns, so it is often
helpful to split a time series into
components, each representing an
underlying pattern category. This is
what decompositional models do.
• We often know how each
component affects the progression
of time series data, e.g., boxing day
sale surge, which makes
decomposition less difficult
https://quantdare.com/decomposition-to-improve-time-series-prediction/

§ In time series forecasting, data
smoothing is a statistical
technique that involves
removing outliers from a time
series data set to make a
pattern more visible. Smoothing
data removes or reduces
random variation and shows
underlying trends and cyclic
components.
Methods of forecasting - Smooth-based
https://statisticsbyjim.com/time-series/exponential-smoothing-time-series-forecasting/

§ The moving-average model specifies that
the output variable depends linearly on the
current and various past values of time
series data.
§ Example:
Smooth-based forecasting - Moving average
https://statisticsbyjim.com/time-series/moving-averages-smoothing/

Classification
§ Assigning time series pattern to a
specific category.
§ It can be used for handwriting
recognition, voice recognition,
speaker recognition, ECG/EEG
signal classification and so on.

Classification – dynamic time wrapping
§ Dynamic time warping (DTW) combined with K-nearest neighbors (K-NN)
has been a benchmark for other time series classification algorithms to
beat.
§ Idea: segmenting the whole data with respect to time, e.g., some
segments are across longer or shorter time slots
§ One classifies a new incoming time series by finding K most similar time
series in the training data and assign the new time series to the class
appear most of the time.

Classification – dynamic time wrapping
§ DTW method needs to calculate the distance between two time series to
(tell if the are close or not). While we can take the distance between each
point in the time series, it is not necessarily clear which points should be
compared to which in the two time series.
§ DTW solves this by pairing up the different time points by drawing lines
between them in such a way that each time point in a series must be
connected to a time point in the other series, and two lines must never
cross. The distance is the sum of the difference between the paired time
points.

Segmentation
§ Splitting a time-series into a sequence of segments. It is often the case that a time-
series can be represented as a sequence of individual segments, each with its own
characteristic properties. For example, the audio signal from an audio conference
call can be partitioned into pieces corresponding to the times during which each
person was speaking. In time-series segmentation, the goal is to identify the
segment boundary points in the time-series, and to characterize the dynamical
properties associated with each segment.

Anomaly detection
§ Outlier is the data points which deviate from some standard or usual pattern as to
arouse suspicions that it may be generated from a different mechanism.”
§ Whether any abnormal signals/events is observed?
https://neptune.ai/blog/anomaly-detection-in-time-series

Sequential pattern mining
§ Sequential pattern mining aims to find statistically relevant patterns between data
examples in data sequence.
§ One of the most important tasks in sequential pattern mining is string mining. The
mining aims to find a string in a sequence. Examples include finding
words/phrases/sentences in a long text, or find a particular pattern of nucleotide
bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences.
This is the technology that has been used to sequencing COVID-19 viruses.

Sequential pattern mining
§ A problem in sequence mining is to find frequent itemsets and the order they
appear.
§ For example, if we want to find whether it is the case "if a {customer buys a book},
he or she is likely to {buy another book} next month", or in the context of stock
prices, "if {price of Apple rises}, it is likely that {price of Samsung rises} in the same
week".
§ Itemset mining is useful to find relationship between frequently co-occurring items in
large transactions. Then the information can be used to develop recommendation
system.

Sequential pattern mining – definition
§ A sequence is an ordered list of elements: s = <e1, e2, e3, e4, e5, e6>.
§ Element e1 happens before e2 which is before e3, and so on.
§ Take the steps to put an elephant into a fridge as an example:
§ Open the door of the fridge
§ Put the elephant into the fridge
§ Close the door
So s = < open the door, put the elephant, close the door>

Sequence dataset
§ A sequence dataset for online transactions may look like this:
§ Then the transaction sequence of customer 1 is:
S1 = < {a,b,c,d}, {d,f}, {a,e,d} >. Then we can check whether another customer has the
similar sequence/sub-sequence of S1, and the use the information to recommend items
that this new customer is interested in.
Customer ID Transaction ID Purchased
C1 1 a,b,c,d
C2 2 a,f,c,e
C1 3 d,f
C3 4 b,c,e,f
C2 5 a,c,d,e
C1 6 a,e,d

Subsequence
§ A sequence <a1, a2, …, an> is contained in another sequence
<b1, b2, …, bm> (m ≥n) if there exist integers i1 < i2 < … < in, such that
a1⊆bi1 , a2⊆bi2, …, an⊆bin

Support of a subsequence
§ The support of a subsequence w is defined as the fraction of data sequences
that contain w
§ A sequential pattern is a frequent subsequence (i.e., a subsequence whose
support is ≥ minsup where minsup means minimum support)
§ Given:
• a database of sequences
• a user-specified minimum support threshold, minsup
§ Task:
• Find all subsequences with support ≥ minsup

apriori property for sequences
§ Let D be a database that contains a collection of data sequences d. The support of
a sequence t is the fraction of all data sequences that contain t:
𝑠 𝑡 =
|𝑑 ∈ 𝐷: 𝑡 𝑖𝑠 𝑎 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑑|
|𝐷|
where |D| is the number of sequences in the database
§ apriori property:
§ If a data sequence d contains a sequence t, then it will also contain any
subsequence of t.
§ Therefore: If w is a subsequence of t, then s(w) ≥ s(t).

Frequent subsequences
If A, B, C, D, E are customers and events are products, then {2,4}, {3},{5} and
{1},{2} are more frequent combinations for recommendation in the following
transactions.

Lecture9_Time_Series_2024_and_data_analysis (1).pdf

More Related Content

Similar to Lecture9_Time_Series_2024_and_data_analysis (1).pdf (20)

Recently uploaded (20)

Lecture9_Time_Series_2024_and_data_analysis (1).pdf