稀疏注意力：时间序列预测的局部性和Transformer的存储瓶颈资源-CSDN下载

transformer

104 浏览量 2024-08-14 16:58:05 上传评论收藏 949KB PDF 举报

资源推荐

资源详情

资源评论

Enhancing the Locality and Breaking the Memory

Bottleneck of Transformer on Time Series Forecasting

Shiyang Li

Xiaoyong Jin

Yao Xuan

Xiyou Zhou

Wenhu Chen

Yu-Xiang Wang

Xifeng Yan

University of California, Santa Barbara

Abstract

Time series forecasting is an important problem across many domains, including

predictions of solar plant energy output, electricity consumption, and trafﬁc jam

situation. In this paper, we propose to tackle such forecasting problem with

Transformer [

]. Although impressed by its performance in our preliminary study,

we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-

product self-attention in canonical Transformer architecture is insensitive to local

context, which can make the model prone to anomalies in time series; (2) memory

bottleneck: space complexity of canonical Transformer grows quadratically with

sequence length

, making directly modeling long time series infeasible. In

order to solve these two issues, we ﬁrst propose convolutional self-attention by

producing queries and keys with causal convolution so that local context can

be better incorporated into attention mechanism. Then, we propose LogSparse

Transformer with only

O(L(log L)

)

memory cost, improving forecasting accuracy

for time series with ﬁne granularity and strong long-term dependencies under

constrained memory budget. Our experiments on both synthetic data and real-

world datasets show that it compares favorably to the state-of-the-art.

1 Introduction

Time series forecasting plays an important role in daily life to help people manage resources and make

decisions. For example, in retail industry, probabilistic forecasting of product demand and supply

based on historical data can help people do inventory planning to maximize the proﬁt. Although

still widely used, traditional time series forecasting models, such as State Space Models (SSMs) [

]

and Autoregressive (AR) models, are designed to ﬁt each time series independently. Besides, they

also require practitioners’ expertise in manually selecting trend, seasonality and other components.

To sum up, these two major weaknesses have greatly hindered their applications in the modern

large-scale time series forecasting tasks.

To tackle the aforementioned challenges, deep neural networks [

] have been proposed as

an alternative solution, where Recurrent Neural Network (RNN) [

] has been employed to

model time series in an autoregressive fashion. However, RNNs are notoriously difﬁcult to train [

]

because of gradient vanishing and exploding problem. Despite the emergence of various variants,

including LSTM [

] and GRU [

], the issues still remain unresolved. As an example, [

] shows

that language models using LSTM have an effective context size of about 200 tokens on average

but are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

capture long-term dependencies. On the other hand, real-world forecasting applications often have

both long- and short-term repeating patterns [

]. For example, the hourly occupancy rate of a freeway

in trafﬁc data has both daily and hourly patterns. In such cases, how to model long-term dependencies

becomes the critical step in achieving promising performances.

Recently, Transformer [

] has been proposed as a brand new architecture which leverages attention

mechanism to process a sequence of data. Unlike the RNN-based methods, Transformer allows the

model to access any part of the history regardless of distance, making it potentially more suitable

for grasping the recurring patterns with long-term dependencies. However, canonical dot-product

self-attention matches queries against keys insensitive to local context, which may make the model

prone to anomalies and bring underlying optimization issues. More importantly, space complexity of

canonical Transformer grows quadratically with the input length

, which causes memory bottleneck

on directly modeling long time series with ﬁne granularity. We speciﬁcally delve into these two

issues and investigate the applications of Transformer to time series forecasting. Our contributions

are three fold:

•

We successfully apply Transformer architecture to time series forecasting and perform extensive

experiments on both synthetic and real datasets to validate Transformer’s potential value in better

handling long-term dependencies than RNN-based models.

•

We propose convolutional self-attention by employing causal convolutions to produce queries and

keys in the self-attention layer. Query-key matching aware of local context, e.g. shapes, can help

the model achieve lower training loss and further improve its forecasting accuracy.

•

We propose LogSparse Transformer, with only

O(L(log L)

)

space complexity to break the

memory bottleneck, not only making ﬁne-grained long time series modeling feasible but also

producing comparable or even better results with much less memory usage, compared to canonical

Transformer.

2 Related Work

Due to the wide applications of forecasting, various methods have been proposed to solve the problem.

One of the most prominent models is

ARIMA

[

]. Its statistical properties as well as the well-

known Box-Jenkins methodology [

] in the model selection procedure make it the ﬁrst attempt for

practitioners. However, its linear assumption and limited scalability make it unsuitable for large-scale

forecasting tasks. Further, information across similar time series cannot be shared since each time

series is ﬁtted individually. In contrast, [

] models related time series data as a matrix and deal with

forecasting as a matrix factorization problem. [

] proposes hierarchical Bayesian methods to learn

across multiple related count time series from the perspective of graph model.

Deep neural networks have been proposed to capture shared information across related time series

for accurate forecasting. [

] fuses traditional AR models with RNNs by modeling a probabilistic

distribution in an encoder-decoder fashion. Instead, [

] uses an RNN as an encoder and Multi-layer

Perceptrons (MLPs) as a decoder to solve the so-called error accumulation issue and conduct multi-

ahead forecasting in parallel. [

] uses a global RNN to directly output the parameters of a linear

SSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear

segments. In contrast, [

] deals with noise using a local Gaussian process for each time series while

using a global RNN to model the shared patterns. [

] tries to combine the advantages of AR models

and SSMs, and maintain a complex latent process to conduct multi-step forecasting in parallel.

The well-known self-attention based Transformer [

] has recently been proposed for sequence

modeling and has achieved great success. Several recent works apply it to translation, speech, music

and image generation [

]. However, scaling attention to extremely long sequences is

computationally prohibitive since the space complexity of self-attention grows quadratically with

sequence length [

]. This becomes a serious issue in forecasting time series with ﬁne granularity

and strong long-term dependencies.

3 Background

Problem deﬁnition

Suppose we have a collection of

related univariate time series

i,1:t

}

i=1

where

i,1:t

, [z

i,1

, z

i,2

, ··· , z

i,t

]

and

i,t

2 R

denotes the value of time series

at time

.We

are going to predict the next

⌧

time steps for all time series, i.e.

i,t

+1:t

+⌧

}

i=1

. Besides, let

i,1:t

+⌧

}

i=1

be a set of associated time-based covariate vectors with dimension

that are assumed

to be known over the entire time period, e.g. day-of-the-week and hour-of-the-day. We aim to model

the following conditional distribution

p(z

i,t

+1:t

+⌧

i,1:t

, x

i,1:t

+⌧

; )=

+⌧

t=t

p(z

i,t

i,1:t1

, x

i,1:t

; ).

We reduce the problem to learning a one-step-ahead prediction model

p(z

1:t1

, x

1:t

; )

, where



denotes the learnable parameters shared by all time series in the collection. To fully utilize both

the observations and covariates, we concatenate them to obtain an augmented matrix as follows:

, [z

t1

 x

] 2 R

d+1

, Y

=[y

, ··· , y

]

2 R

t⇥(d+1)

where

[··]

represents concatenation. An appropriate model

⇠ f(Y

)

is then explored to predict

the distribution of z

given Y

Transformer

We instantiate

with Transformer

by taking advantage of the multi-head self-

attention mechanism, since self-attention enables Transformer to capture both long- and short-term

dependencies, and different attention heads learn to focus on different aspects of temporal patterns.

These advantages make Transformer a good candidate for time series forecasting. We brieﬂy introduce

its architecture here and refer readers to [1] for more details.

In the self-attention layer, a multi-head self-attention sublayer simultaneously transforms

into

distinct query matrices

= YW

, key matrices

= YW

, and value matrices

= YW

respectively, with

h =1, ··· ,H

. Here

, W

2 R

(d+1)⇥d

and

2 R

(d+1)⇥d

are learnable

parameters. After these linear projections, the scaled dot-product attention computes a sequence of

vector outputs:

= Attention(Q

, K

, V

)=softmax

✓

· M

◆

Note that a mask matrix

is applied to ﬁlter out rightward attention by setting all upper triangular

elements to

1

, in order to avoid future information leakage. Afterwards,

, O

, ··· , O

are

concatenated and linearly projected again. Upon the attention output, a position-wise feedforward

sublayer with two layers of fully-connected network and a ReLU activation in the middle is stacked.

4 Methodology

4.1 Enhancing the locality of Transformer

Patterns in time series may evolve with time signiﬁcantly due to various events, e.g. holidays and

extreme weather, so whether an observed point is an anomaly, change point or part of the patterns

is highly dependent on its surrounding context. However, in the self-attention layers of canonical

Transformer, the similarities between queries and keys are computed based on their point-wise values

without fully leveraging local context like shape, as shown in Figure 1(a) and (b). Query-key matching

agnostic of local context may confuse the self-attention module in terms of whether the observed

value is an anomaly, change point or part of patterns, and bring underlying optimization issues.

We propose convolutional self-attention to ease the issue. The architectural view of proposed

convolutional self-attention is illustrated in Figure 1(c) and (d). Rather than using convolution of

Here time index

is relative, i.e. the same

in different time series may represent different actual time point.

Since the model is applicable to all time series, we omit the subscript i for simplicity and clarity.

By referring to Transformer, we only consider the autoregressive Transformer-decoder in the following.

At each time step the same model is applied, so we simplify the formulation with some abuse of notation.

剩余10页未读，继续阅读

评论收藏

内容反馈

托比-马奎尔

粉丝: 1808

稀疏注意力：时间序列预测的局部性和Transformer的存储瓶颈

Transformer模型Python代码：多头自注意力机制的时间序列预测革新解法,基于多头自注意力机制的Transformer模型：时间序列预测的Python代码实现,Transformer多头自注

Transformer在时间序列预测中的应用

LSTM+Transformer时间序列预测（Pytorch完整源码和数据）

时间序列Transformer for TimeSeries时序预测算法详解.docx

时间序列预测-Transformer,Informer,Autoformer,FEDformer复现结果

基于Python的Transformer多头自注意力机制时间序列预测模型实现 · PyTorch

PatchTST模型：时间序列预测的单输入单输出、多输入多输出，高精度之魔改transformer模型

基于Python的Transformer多头自注意力机制时间序列预测模型及其优化

Transformer-BiLSTM多特征输入时间序列预测（Pytorch完整源码和数据）

Transformer时间序列预测（Pytorch完整源码和数据）

【时间序列预测】 Matlab实现Transformer多变量时间序列预测的详细项目实例（含模型描述及示例代码）

【时间序列预测】 Matlab实现Transformer-GRU多变量时间序列预测的详细项目实例（含模型描述及示例代码）

【Python编程】基于arxiv库的学术论文搜索函数实现：时间序列预测领域非Transformer架构文献查询系统设计

基于Transformer模型的时间序列预测python源码（高分项目）.zip

【时间序列预测】 Matlab实现Transformer-Adaboost时间序列预测的详细项目实例（含模型描述及示例代码）

【时间序列预测】Matlab实现Transformer-GRU时间序列预测的详细项目实例（含模型描述及示例代码）

Pytorch实现TCN-Transformer的时间序列预测（完整源码和数据)

时间序列预测数据时间序列预测数据

基于TCN-Transformer模型的时间序列预测（Python完整源码）

【时间序列预测】 Matlab实现Transformer-LSTM时间序列预测的详细项目实例（含模型描述及示例代码）

【多变量时间序列预测】 Matlab实现Transformer-Adaboost多变量时间序列预测的详细项目实例（含模型描述及示例代码）

Transformer-BiGRU多特征输入时间序列预测（Pytorch完整源码和数据）

【时间序列预测】 Matlab实现CNN-Transformer多变量时间序列预测的详细项目实例（含模型描述及示例代码）

基于IASB Transformer双卷积机制的时间序列预测创新模型：高精度预测的探索与实践

【时间序列预测】Matlab实现基于Transformer多变量时间序列多步预测的详细项目实例（含模型描述及示例代码）

SSA麻雀算法+KAN+Transformer时间序列预测（Python代码和数据）

PatchTST模型：自监督时间序列预测的革新与高精度应用,PatchTST模型：基于Transformer的自监督时间序列预测模型，单多输入输出兼顾，局部特征与多维序列的精确表征,PatchTST模

【多变量时间序列预测】 Matlab实现Transformer-BiLSTM多变量时间序列预测的详细项目实例（含模型描述及示例代码）

【时间序列预测】 MATLAB 实现基于Transformer模型进行时间序列预测模型的项目详细实例（含模型描述及示例代码）

OPC Client第8讲（CMake）：CMake教程（动/静态库区别举例），运行SDK的例子

语音技术实现的源代码

最新资源