没有合适的资源?快使用搜索试试~ 我知道了~
稀疏注意力:时间序列预测的局部性和Transformer的存储瓶颈
0 下载量 104 浏览量
2024-08-14
16:58:05
上传
评论
收藏 949KB PDF 举报
温馨提示
稀疏注意力:时间序列预测的局部性和Transformer的存储瓶颈
资源推荐
资源详情
资源评论































Enhancing the Locality and Breaking the Memory
Bottleneck of Transformer on Time Series Forecasting
Shiyang Li
Xiaoyong Jin
Yao Xuan
Xiyou Zhou
Wenhu Chen
Yu-Xiang Wang
Xifeng Yan
University of California, Santa Barbara
Abstract
Time series forecasting is an important problem across many domains, including
predictions of solar plant energy output, electricity consumption, and traffic jam
situation. In this paper, we propose to tackle such forecasting problem with
Transformer [
1
]. Although impressed by its performance in our preliminary study,
we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-
product self-attention in canonical Transformer architecture is insensitive to local
context, which can make the model prone to anomalies in time series; (2) memory
bottleneck: space complexity of canonical Transformer grows quadratically with
sequence length
L
, making directly modeling long time series infeasible. In
order to solve these two issues, we first propose convolutional self-attention by
producing queries and keys with causal convolution so that local context can
be better incorporated into attention mechanism. Then, we propose LogSparse
Transformer with only
O(L(log L)
2
)
memory cost, improving forecasting accuracy
for time series with fine granularity and strong long-term dependencies under
constrained memory budget. Our experiments on both synthetic data and real-
world datasets show that it compares favorably to the state-of-the-art.
1 Introduction
Time series forecasting plays an important role in daily life to help people manage resources and make
decisions. For example, in retail industry, probabilistic forecasting of product demand and supply
based on historical data can help people do inventory planning to maximize the profit. Although
still widely used, traditional time series forecasting models, such as State Space Models (SSMs) [
2
]
and Autoregressive (AR) models, are designed to fit each time series independently. Besides, they
also require practitioners’ expertise in manually selecting trend, seasonality and other components.
To sum up, these two major weaknesses have greatly hindered their applications in the modern
large-scale time series forecasting tasks.
To tackle the aforementioned challenges, deep neural networks [
3
,
4
,
5
,
6
] have been proposed as
an alternative solution, where Recurrent Neural Network (RNN) [
7
,
8
,
9
] has been employed to
model time series in an autoregressive fashion. However, RNNs are notoriously difficult to train [
10
]
because of gradient vanishing and exploding problem. Despite the emergence of various variants,
including LSTM [
11
] and GRU [
12
], the issues still remain unresolved. As an example, [
13
] shows
that language models using LSTM have an effective context size of about 200 tokens on average
but are only able to sharply distinguish 50 tokens nearby, indicating that even LSTM struggles to
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

capture long-term dependencies. On the other hand, real-world forecasting applications often have
both long- and short-term repeating patterns [
7
]. For example, the hourly occupancy rate of a freeway
in traffic data has both daily and hourly patterns. In such cases, how to model long-term dependencies
becomes the critical step in achieving promising performances.
Recently, Transformer [
1
,
14
] has been proposed as a brand new architecture which leverages attention
mechanism to process a sequence of data. Unlike the RNN-based methods, Transformer allows the
model to access any part of the history regardless of distance, making it potentially more suitable
for grasping the recurring patterns with long-term dependencies. However, canonical dot-product
self-attention matches queries against keys insensitive to local context, which may make the model
prone to anomalies and bring underlying optimization issues. More importantly, space complexity of
canonical Transformer grows quadratically with the input length
L
, which causes memory bottleneck
on directly modeling long time series with fine granularity. We specifically delve into these two
issues and investigate the applications of Transformer to time series forecasting. Our contributions
are three fold:
•
We successfully apply Transformer architecture to time series forecasting and perform extensive
experiments on both synthetic and real datasets to validate Transformer’s potential value in better
handling long-term dependencies than RNN-based models.
•
We propose convolutional self-attention by employing causal convolutions to produce queries and
keys in the self-attention layer. Query-key matching aware of local context, e.g. shapes, can help
the model achieve lower training loss and further improve its forecasting accuracy.
•
We propose LogSparse Transformer, with only
O(L(log L)
2
)
space complexity to break the
memory bottleneck, not only making fine-grained long time series modeling feasible but also
producing comparable or even better results with much less memory usage, compared to canonical
Transformer.
2 Related Work
Due to the wide applications of forecasting, various methods have been proposed to solve the problem.
One of the most prominent models is
ARIMA
[
15
]. Its statistical properties as well as the well-
known Box-Jenkins methodology [
16
] in the model selection procedure make it the first attempt for
practitioners. However, its linear assumption and limited scalability make it unsuitable for large-scale
forecasting tasks. Further, information across similar time series cannot be shared since each time
series is fitted individually. In contrast, [
17
] models related time series data as a matrix and deal with
forecasting as a matrix factorization problem. [
18
] proposes hierarchical Bayesian methods to learn
across multiple related count time series from the perspective of graph model.
Deep neural networks have been proposed to capture shared information across related time series
for accurate forecasting. [
3
] fuses traditional AR models with RNNs by modeling a probabilistic
distribution in an encoder-decoder fashion. Instead, [
19
] uses an RNN as an encoder and Multi-layer
Perceptrons (MLPs) as a decoder to solve the so-called error accumulation issue and conduct multi-
ahead forecasting in parallel. [
6
] uses a global RNN to directly output the parameters of a linear
SSM at each step for each time series, aiming to approximate nonlinear dynamics with locally linear
segments. In contrast, [
9
] deals with noise using a local Gaussian process for each time series while
using a global RNN to model the shared patterns. [
20
] tries to combine the advantages of AR models
and SSMs, and maintain a complex latent process to conduct multi-step forecasting in parallel.
The well-known self-attention based Transformer [
1
] has recently been proposed for sequence
modeling and has achieved great success. Several recent works apply it to translation, speech, music
and image generation [
1
,
21
,
22
,
23
]. However, scaling attention to extremely long sequences is
computationally prohibitive since the space complexity of self-attention grows quadratically with
sequence length [
21
]. This becomes a serious issue in forecasting time series with fine granularity
and strong long-term dependencies.
2

3 Background
Problem definition
Suppose we have a collection of
N
related univariate time series
{z
i,1:t
0
}
N
i=1
,
where
z
i,1:t
0
, [z
i,1
, z
i,2
, ··· , z
i,t
0
]
and
z
i,t
2 R
denotes the value of time series
i
at time
t
1
.We
are going to predict the next
⌧
time steps for all time series, i.e.
{z
i,t
0
+1:t
0
+⌧
}
N
i=1
. Besides, let
{x
i,1:t
0
+⌧
}
N
i=1
be a set of associated time-based covariate vectors with dimension
d
that are assumed
to be known over the entire time period, e.g. day-of-the-week and hour-of-the-day. We aim to model
the following conditional distribution
p(z
i,t
0
+1:t
0
+⌧
|z
i,1:t
0
, x
i,1:t
0
+⌧
; )=
t
0
+⌧
Y
t=t
0
+1
p(z
i,t
|z
i,1:t1
, x
i,1:t
; ).
We reduce the problem to learning a one-step-ahead prediction model
p(z
t
|z
1:t1
, x
1:t
; )
2
, where
denotes the learnable parameters shared by all time series in the collection. To fully utilize both
the observations and covariates, we concatenate them to obtain an augmented matrix as follows:
y
t
, [z
t1
x
t
] 2 R
d+1
, Y
t
=[y
1
, ··· , y
t
]
T
2 R
t⇥(d+1)
,
where
[··]
represents concatenation. An appropriate model
z
t
⇠ f(Y
t
)
is then explored to predict
the distribution of z
t
given Y
t
.
Transformer
We instantiate
f
with Transformer
3
by taking advantage of the multi-head self-
attention mechanism, since self-attention enables Transformer to capture both long- and short-term
dependencies, and different attention heads learn to focus on different aspects of temporal patterns.
These advantages make Transformer a good candidate for time series forecasting. We briefly introduce
its architecture here and refer readers to [1] for more details.
In the self-attention layer, a multi-head self-attention sublayer simultaneously transforms
Y
4
into
H
distinct query matrices
Q
h
= YW
Q
h
, key matrices
K
h
= YW
K
h
, and value matrices
V
h
= YW
V
h
respectively, with
h =1, ··· ,H
. Here
W
Q
h
, W
K
h
2 R
(d+1)⇥d
k
and
W
V
h
2 R
(d+1)⇥d
v
are learnable
parameters. After these linear projections, the scaled dot-product attention computes a sequence of
vector outputs:
O
h
= Attention(Q
h
, K
h
, V
h
)=softmax
✓
Q
h
K
T
h
p
d
k
· M
◆
V
h
.
Note that a mask matrix
M
is applied to filter out rightward attention by setting all upper triangular
elements to
1
, in order to avoid future information leakage. Afterwards,
O
1
, O
2
, ··· , O
H
are
concatenated and linearly projected again. Upon the attention output, a position-wise feedforward
sublayer with two layers of fully-connected network and a ReLU activation in the middle is stacked.
4 Methodology
4.1 Enhancing the locality of Transformer
Patterns in time series may evolve with time significantly due to various events, e.g. holidays and
extreme weather, so whether an observed point is an anomaly, change point or part of the patterns
is highly dependent on its surrounding context. However, in the self-attention layers of canonical
Transformer, the similarities between queries and keys are computed based on their point-wise values
without fully leveraging local context like shape, as shown in Figure 1(a) and (b). Query-key matching
agnostic of local context may confuse the self-attention module in terms of whether the observed
value is an anomaly, change point or part of patterns, and bring underlying optimization issues.
We propose convolutional self-attention to ease the issue. The architectural view of proposed
convolutional self-attention is illustrated in Figure 1(c) and (d). Rather than using convolution of
1
Here time index
t
is relative, i.e. the same
t
in different time series may represent different actual time point.
2
Since the model is applicable to all time series, we omit the subscript i for simplicity and clarity.
3
By referring to Transformer, we only consider the autoregressive Transformer-decoder in the following.
4
At each time step the same model is applied, so we simplify the formulation with some abuse of notation.
3
剩余10页未读,继续阅读
资源评论


托比-马奎尔
- 粉丝: 1808
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- Java语言字符串前补零和后补零的快速方法
- 基于RRT与自重构技术的UAV编队避障与动态队形调整研究 · RRT
- 基于Simulink的单轮车辆ABS防抱死控制系统仿真模型及其应用 · Simulink 深度版
- Java语言移动整个文件夹或单个文件到另一个文件夹
- Python实现神经网络模型的数学公式识别源码文档说明
- 电力系统中配电网最优潮流的SOCP松弛技术应用与实现
- WinCC高级报表工具:自定义模板与多格式输出在工业自动化中的应用
- 基于ADRC控制的Matlab Simulink半车主动悬架建模:优化车身加速度与悬架性能的仿真研究 MatlabSimulink
- Java中文件与字节数组(byte)相互转换
- 使用PyTorch深度学习框架基于BiLSTM CRF的中文分词系统
- 基于BP神经网络的MNIST手写数字识别Python源码(期末大作业)
- C#基于.NET框架的串口数据读取与多曲线显示系统的实现
- Java语言清空文件夹下所有文件
- 基于OpenCV C#开发的圆卡尺、矩形卡尺等测量工具源码集,含视觉控件与自定义图形工具,运行稳定且操作便捷 v3.0
- PFC5.0技术下的预制裂隙含锚杆试样单轴压缩特性研究
- COMSOL多物理场仿真:压电效应中结构力学与静电场耦合模型及其应用
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
