A Transformer-based Framework for Multivariate Time Series Representation Learning

Copyright © 2020 調和系工学研究室 - 北海道大学大学院情報科学研究院情報理工学部門複合情報工学分野 – All rights reserved.
論文紹介
A Transformer-based Framework for Multivariate
Time Series Representation Learning
北海道大学大学院情報科学研究院
情報理工学部門複合情報工学分野調和系工学研究室
劉兆邦
2022年06月20日

• 著者
– George Zerveas, Srideepika Jayaraman, Dhaval Patel,
Anuradha Bhamidipaty, Carsten Eickhoff
• 発表
– Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining
• 論文リンク
– https://dl.acm.org/doi/abs/10.1145/3447548.3467401?casa_t
oken=HbWWl3ksNy4AAAAA:watSSa0fom_EbxcyDmj8vMTSm
hxjuj0XzZ5lpJYCtzSIEvwys4my5p8ksSsfSLsdfZAPpQokiQEo
Paper information 2

• A novel framework for multivariate time series representation learning
based on the transformer encoder architecture
• The framework includes an unsupervised pre-training scheme, which
can offer substantial performance benefits over fully supervised
learning on downstream tasks
• Performs significantly better than the best currently available methods
for regression and classification
• The first unsupervised method shown to push the limits of state-of-
the-art performance for multivariate time series regression and
classification
Abstract 3

Unlike in domains such as Computer Vision or Natural Language
Processing (NLP), the dominance of deep learning for time series
is far from established
Non-deep learning methods such as TS-CHIEF, HIVE-COTE, and ROCKET
currently hold the record on time series regression and classification
dataset benchmarks
Transformer models are based on a multi-headed attention mechanism
that renders them particularly suitable for time series data
Develop a generally applicable methodology (framework) that can
leverage unlabeled data by first training a transformer encoder to extract
dense vector representations of multivariate time series through an input
“denoising” (autoregressive) objective.
Introduction 4

5
Methodology-Base model
However, the decoder module needs the (masked) “ground truth” output sequence as
an input, and is thus unsuitable for tasks such as classification or (extrinsic) regression.
[1]Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information
processing systems, 2017, 30.
Encoder
Decoder

Methodology-Base model 6
各時系列の長さ
はｗ，時系列変
数はｍ個ある
線形変化
畳み込み

Methodology-Base model 7
Positional encodings
Based on the performance of our models, we also observe that the positional
encodings generally appear not to significantly interfere with the numerical
information of the time series,
Padding
• After setting a maximum sequence length 𝑤 for the entire dataset, shorter
samples are padded with arbitrary values
• Generate a padding mask which adds a large negative value to the attention
scores for the padded positions, before computing the self-attention distribution
with the softmax function

Methodology-Regression and classification 8
出力のｚを一つのベクトルと連結して、線形変化層の中に
入って、モデルをレグレッションと分類タスクに変更する

Methodology-Unsupervised pre-training 9
We set part of the input to 0 and ask the
model to predict the masked values
A binary noise mask, is created
independently for each training sample and
epoch, and the input is masked by
elementwise multiplication:
各行に0のsegmentsの長さの平均値
各行1のsegmentsの長さの平均値
Masking 割合
𝑟 ∗ 𝑚 各列Maskingした変数の平均値

10
Methodology-Unsupervised pre-training
We chose this masking pattern because it encourages the model to learn to attend
both to preceding and succeeding segments in individual variables, as well as to
existing contemporary values of the other variables in the time series, and thereby
to learn to model inter-dependencies between variables.
Maskingした部分
のLossだけを計算

11
Experiments & Results-Regression
TST (Time Series Transformer)
• proposed approach achieves an average rank of 1.33
• pre-trained transformer models outperform the fully
supervised ones in 3 out of 6 datasets
‒ no additional samples are used for pretraining
average relative difference from mean
低いと、平均的な
効果が良い(平均
RMSEとの差)

12
Q1: Given a partially labeled dataset of a certain size, how will additional
labels affect performance?
• As expected, with an increasing proportion of available labels performance improves both
for a fully supervised model, as well as the same model that has been first pre-trained on
the entire training set through the unsupervised objective and then fine-tuned
• not only does the pretrained model outperform the fully supervised one, but the
benefit persists throughout the entire range of label availability, even when the
models are allowed to use all labels

13
Q2: Given a labeled dataset, how will additional unlabeled samples
affect performance?
• for a given number of labels (shown as a percentage of the totally available labels),
the more data samples are used for unsupervised learning, the lower the error
achieved
• reusing a subset of the same samples for unsupervised pretraining improves
performance
fully supervised
training only

14
Experiments & Results-Classification
• performed best on 7 out of the 11 datasets, achieving an average rank of 1.7
• We believe that this indicates a relative weakness of our current models when
dealing with very low dimensional time series(3-dimensional)
• Finally, we observe that the pre-trained transformer models performed better
than the fully supervised ones in 8 out of 11 datasets,sometimes by a substantial
margin.
‾ suggesting the benefit to originate from merely reusing the same samples in a
different training task

15
Additional points
Execution time on Tesla P100 GPU
In practice,despite allowing for many hundreds of epochs, using a GPU we never
trained our models longer than 3 hours on any of the examined datasets

Conclusion 16
➢ Propose a transformer-based framework for unsupervised representation
learning of multivariate time series
➢ Employing unsupervised learning of multivariate time series that
surpasses the performance of all current state-of-the-art supervised
methods
➢ Unsupervised pre-training of our transformer models offers a substantial
performance benefit over fully supervised learning, even without
leveraging additional unlabeled data,
➢ the proposed framework can be readily used for additional downstream
tasks, such as forecasting, clustering and missing value imputation

A Transformer-based Framework for Multivariate Time Series Representation Learning

More Related Content

What's hot (20)

Similar to A Transformer-based Framework for Multivariate Time Series Representation Learning (20)

More from harmonylab (20)

Recently uploaded (20)

A Transformer-based Framework for Multivariate Time Series Representation Learning