1 Introduction

Time-series forecasting is an important part of the decision-making process in a wide range of application domains such as low-carbon electricity (Rolnick et al., 2019; Xu & Xie, 2021), stock price predictions, service demand forecasting, and medical prognoses (Stankevičiūtė et al., 2021). While modern time-series forecasting techniques have become increasingly accurate (Hossain & Mahmood, 2020; Lin et al., 2021, 2021) they usually produce point forecasts, providing little to no information about the uncertainty associated with their predictions. This limited quantification of the uncertainty constrains decision-making in practice, especially in high-stakes situations.

Recently, conformal prediction has been proposed for uncertainty quantification of models' predictions (Vovk et al., 2005; Barber et al., 2022). It is a modern, model agnostic framework capable of providing valid prediction sets or intervals without making assumptions about the distribution of the data. Conformal prediction relies however on two assumptions that are generally not met in time-series forecasting tasks. It assumes, that the examples in the data set are exchangeable and that the model fitting algorithm for the underlying model treats the examples in the data set symmetrically, i.e. that the model fitting algorithm produces the same model independently of the order of the examples in the training set.

Multiple attempts (Gibbs & Candès, 2021; Barber et al., 2022; Chernozhukov et al., 2018; Feldman et al., 2022; Stankevičiūtė et al., 2021) have been made to lift these requirements imposed by conformal prediction. While some of these techniques allow for non-exchangeable data and asymmetrical model-fitting algorithms, their applicability is generally restricted because they are limited to univariate, single-step-ahead time-series forecasting tasks.

Motivated by Rolnick et al. (2019), we set out to analyze if and how conformal prediction can be extended to the more complex multistep-ahead multivariate time-series forecasting tasks. The work of Stankevičiūtė et al. (2021) provides an approach that allows for the construction of valid prediction regions for multi-target regression by combining multiple conformal predictors. However, its application to time-series is limited by the exchangeability requirement. Barber et al. (2022) propose a method that is specifically designed to overcome the requirements of the original conformal predictor and render it applicable to time-series forecasting tasks. These properties are achieved under the assumption that the method’s ideal parameters are known or can be estimated well enough. While powerful, their method is bound to single-target regression. We propose a new method, non-exchangeable multi-target conformal prediction (nmtCP), that is capable of producing robust prediction regions for multistep-ahead multivariate time-series forecasting tasks. nmtCP achieves this by combining the approaches taken by Barber et al. (2022) and Stankevičiūtė et al. (2021). This article describes the general concept of nmtCP, shows how it is implemented in practice, offers a theoretical analysis of its validity, and compares it to existing methods.

The development of nmtCP is preceded by an extensive literature review that spans form a standard conformal prediction method to the most recent adaptations for the time-series domain, describing their respective contributions. This includes a discussion of the difficulties that arise when applying conformal prediction to multi-target regression and some solutions proposed by other researchers in the field.

Following the theoretical analysis of nmtCP, a series of experiments validate its theoretical properties on four synthetic and two real-world data sets. This also serves to compare nmtCP to another state-of-the-art method and to discuss its behavior and performance. To conduct the practical experiments, we implement the method as one step in a modular pipeline that can easily accommodate different underlying models.

The contributions of this article are summarized as follows:

  • An in-depth literature review of related works. It provides a comparison of existing methods with regard to their applicability to non-exchangeable data and to multi-target regression in the context of time-series forecasting.

  • A new method, nmtCP, to generate prediction regions for multistep-ahead multivariate time-series forecasting tasks.

  • A theoretical proof of the validity of nmtCP.

  • Systematic experiments applying nmtCP that confirm the theoretical results.

While a subset of these results was included in Schlembach et al. (2022), that publication misses the context provided by the literature review and the discussion and, crucially, the theoretical proof provided in this article.

The remainder of this article is structured as follows. First, Sect. 2 introduces conformal prediction and discusses its subsequent extensions related to time-series forecasting and multi-target regression tasks. Next, Sect. 3 describes nmtCP and its theoretical properties. Subsequently, in Sect. 4, nmtCP’s theoretical claims are tested experimentally on simulated and real-world data sets. Finally, Sect. 5 gives the conclusions and provides an outlook for future research.

2 Background and related work

This section provides the necessary background and reviews the relevant related work. Subsect. 2.1 introduces (inductive) conformal prediction by Vovk et al. (2005) and the learning settings in which it can be used. This is followed by a description of the extensions of ICPs for time-series forecasting in Subsect. 2.2. Subsection 2.3 introduces recent methods for conformal multi-target regression. Further analyses and discussions are provided in Subsection 2.4.

2.1 Conformal prediction

The upper bounds on the probability of error of machine learning algorithms’ predictions provided by statistical learning theory are generally too loose to be useful in practice (Vovk et al. 2005, p.4). This is why Vovk et al. (2005) set out to quantify the confidence of machine learning algorithms’ outputs derived from statistical learning theory. Their idea is to use a model’s performance on previous samples as well as information about the current sample to estimate the confidence with which the label of a new object is predicted. Conformal Prediction, the resulting method, can be used on top of almost any machine-learning algorithm while retaining its theoretical guarantees (Vovk et al., 2005) under minimal assumptions (Toccaceli, 2022). It usually only requires that the examples in the training set are exchangeable and that the underlying model treats them symmetrically (Barber et al., 2022). Conformal Prediction also requires minimal to no modifications of the underlying model (Stankevičiūtė et al., 2021). In the next two subsections we provide a detailed description of computationally efficient conformal predictors used in the method presented in this article.

2.1.1 Inductive conformal predictors

Let \(\mathcal {X}\) be a measurable space of objects and \(\mathcal {Y}\) be a measurable space of labels [cf. Vovk et al. (2005)]. The example space \(\mathcal {Z}\) is defined as the Cartesian product \(\mathcal {X} \times \mathcal {Y}\). A sequence \(z_{1:n}\) of n training examples \(z_1, \dots , z_n\) is generated by an unknown probability distribution Q defined over \(\mathcal {Z}\). We note that no assumptions are made about Q beyond the exchangeability assumption. In this context we can train two types of predictors: simple predictors and confidence predictors.

A simple predictor h is defined as \(h: \mathcal {Z}^{(n)} \times \mathcal {X} \rightarrow \mathcal {Y}\) where \(\mathcal {Z}^{(n)}\) is the set of all bags of size n of elements from \(\mathcal {Z}\). Given a training bag and object \(x_{n+1} \in \mathcal {X}\), the predictor h assigns a label \(\hat{y}_{n+1} \in \mathcal {Y}\) to \(x_{n+1}\) (Vovk et al., 2005, p.18). In contrast, \(\Gamma : \mathcal {Z}^{(n)} \times \mathcal {X} \times (0, 1) \rightarrow 2^\mathcal {Y}\) is a confidence predictor. Given training bag , object \(x_{n+1} \in \mathcal {X}\), and significance level \(\alpha \in (0, 1)\), the predictor \(\Gamma\) assigns a region \(\hat{\textbf{y}}_{n+1}\) of \(\mathcal {Y}\) to \(x_{n+1}\) (Vovk et al., 2005, p.19).

A confidence predictor \(\Gamma\) is said to be exactly valid or exact if for any example \(z_{n+1} = (x_{n+1}, y_{n+1})\) the error probability \(\textsf{P}(y_{n+1} \notin \hat{\textbf{y}}_{n+1}) = \alpha\), i.e. the probability of the true label \(y_{n+1}\) not being included in the prediction region \(\hat{\textbf{y}}_{n+1}\) equals significance level \(\alpha\) (Vovk etal., 2005, p.20). If \(\textsf{P}(y_{n+1} \notin \hat{\textbf{y}}_{n+1}) \le \alpha\), confidence predictor \(\Gamma\) is said to be is conservatively valid (Vovk et al., 2005, p.20). Vovk et al. (2005, p.21) show that no confidence predictor achieves exact validity.

In addition to being valid, confidence predictors need to be informationally efficient. A confidence predictor \(\Gamma\) is said to be informationally efficient if the prediction regions \(\hat{\textbf{y}}_{n+1} \subseteq \mathcal {Y}\) are non-empty and as small as possible (Vovk et al., 2005, p.20).

Inductive conformal predictors (ICPs) are conservatively valid confidence predictors. They are usually a preferred choice due to their computational efficiency. This is achieved by splitting the training data into two disjoint sets (bags), a proper training set of size \(m_{0}\) and a calibration set of size \(m_{c} = n-m_{0}\)  (Toccaceli, 2022)Footnote 1 The proper training set \(\mathcal {D}_{train}\) is used to train a nonconformity measure (NCM) \(A: \mathcal {Z}^{(m_0)} \times \mathcal {Z} \rightarrow \bar{\mathbb {R}}\). A is a measurable function that estimates how different any example in \(\mathcal {D}_{cal}\) is from the examples in \(\mathcal {D}_{train}\) (Vovk et al., 2005).

A common class of nonconformity measures is based on a simple predictor \(\hat{\mu } = h(\mathcal {D}_{train})\). An example of such a nonconformity measure A are residuals:

$$\begin{aligned} r_i = {\left\{ \begin{array}{ll} |y_i - \hat{\mu }(x_i)| & \forall i = m_{0}+1, \dots , n \\ |\bar{y}_{n+1} - \hat{\mu }(x_{n+1})| & i = n+1 \end{array}\right. } \end{aligned}$$
(1)

Once the residuals have been computed for the validation set, an ICP determined by the NCM (1) produces the prediction region:

$$\begin{aligned} \hat{\textbf{y}}_{n+1} = \Gamma ^\alpha (\mathcal {D}, x_{n+1}) = \hat{\mu }(x_{n+1}) \pm \textsf{Q}_{1-\alpha }\left( \sum _{i=m_{0}+1}^{n} \tfrac{1}{m_{c}+1} \cdot \delta _{r_{i}} + \tfrac{1}{m_{c}+1} \cdot \delta _{+\infty } \right) \end{aligned}$$
(2)

where \(\textsf{Q}_{1-\alpha }\) denotes the quantile function and \(\delta _x\) the Kronecker delta function. \(\textsf{Q}_{1-\alpha }\) returns the \(\lceil (1 - \alpha )(m_{c}+1) \rceil\)th smallest value of the residuals \(r_{m_{0}+1:n}\) (Barber et al., 2022; Stankevičiūtė et al., 2021).

We note that in contrast to other types of conformal predictors, ICPs split the training data D and use less data for training the nonconformity measures as well as for computing prediction set. Thus, ICPs are relatively less informally efficient; i.e. they trade the informal efficiency for the computational efficiency.

For a more in-depth overview of conformal prediction, please refer to (Fontana et al., 2023; Toccaceli, 2022; Vovk et al., 2005). An excellent hands-on introduction to conformal prediction is provided by Angelopoulos and Bates (2023).

2.1.2 On-line, semi-on-line, semi-off-line and off-line settings

ICPs can be used in on-line, semi-on-line, semi-off-line and off-line settings. This subsection outlines how these situations are handled in practice.

In an on-line setting nature provides objects \(x_i\) sequentially and their corresponding labels \(y_i\) delayed by one time step (Vovk et al., 2005, p.5). This means that at the \(n_{0}+1\)st step when , an ICP as defined in Eq. (2) can generate a prediction region \(\hat{\textbf{y}}_{n_{0}+1} = \Gamma ^\alpha (\mathcal {D}, x_{n_{0}+1})\) (Vovk et al., 2005, p.5). Then, the true label \(y_{n_{0}+1}\) and a new object \(x_{n_{0}+2}\) are revealed and the ICP can make the prediction \(\hat{\textbf{y}}_{n_{0}+2}\), and so on (Vovk et al., 2005, p.6). As a consequence the training data \(\mathcal {D}\) grows at every time step. In this setting, the underlying model is re-trained at every time step for a new split of the updated training data \(\mathcal {D}\) as shown in Algorithm 1.

Algorithm 1
figure a

On-line inductive conformal prediction for every new time step.

ICPs in the online setting, such as presented in Algorithm 1, require the underlying model h to be fitted for every new object that is encountered. To fully utilize the computational improvements provided by ICPs, Vovk et al. (2005, p.98) propose to use sequence of numbers \(m_{1}< m_{2} < \dots\) with \(m_{i} \in \mathbb {N}\). These numbers \(m_{i}\) act as thresholds that determine after how many new observations the underlying model is fitted to an updated training set. Once the number of known examples n surpasses a previously unsurpassed \(m_k\), such that \(m_{k} < n \le m_{k+1}\), the underlying model is fitted to proper training examples . The remaining examples \(z_{m_{k}+1:n}\) are used as calibration set \(\mathcal {D}_{cal}\) to compute the residuals (Vovk et al., 2005, p.99). Moving forward we call this situation the semi-on-line setting.

Vovk et al. (2005, p.110) call the semi-on-line setting with only a single \(m_{1}\) semi-off-line ICP. The underlying model is only trained once while nature reveals new objects sequentially and their true labels with a delay. Once a label becomes known, it, together with its associated object, is added to the calibration set. This procedure ensures that the method retains the validity guarantees (Vovk et al., 2005, p.111).

Finally, in the off-line setting, nature does not provide the objects and labels sequentially. Instead, it provides all the labeled data and a set of objects \(x_{n+1:n+m_{u}}\) with unknown labels. In this situation the underlying model is fitted only once to the proper training set \(\mathcal {D}_{train}\) and the ICP is calibrated on the calibration set \(\mathcal {D}_{cal}\). Then the resulting ICP generates prediction regions for all \(x_{n+1:n+m_{u}}\). We note that this setup weakens the validity guarantee. While \(\textsf{P}(y_i \notin \hat{\textbf{y}}_{i}) \le \alpha , \forall i \in n+1, \dots , n+m_{u}\) still holds, the errors are no longer independent and \(\sum _{i=1}^{n+m_{u}} \text {err}_i / m_{u} \le \alpha\) may no longer be true (Vovk et al., 2005, p.112).

Note that we make the distinction between the on-line setting, the semi-on-line setting, the semi-off-line and the off-line setting based on the number of times the underlying model is fitted to a set of examples and the points in time when that happens. This classification is different from the one used by Toccaceli (2022) who uses batch mode of operation to denote what we call the off-line setting.

To summarize, we describe the different settings using the following terms:

  • On-line setting: the underlying model is fitted for every new example that nature provides with a new training and calibration split;

  • Semi-on-line: the underlying the model is fitted periodically, after \(m_1\), \(m_2\), \(\dots\) examples and the calibration set is updated with every new observed label;

  • Semi-off-line: underlying model is only trained once and the calibration set is updated with every new observed label;

  • Off-line: the underlying model is only trained once and the calibration set is not updated.

2.2 Conformal time-series forecasting

The definition of ICPs in an on-line setting provided in Eq. (2) and the subsequent variations lose their validity guarantee if applied to time-series. This is because the observations in time-series are generally not exchangeable. In this subsection, we present the work of some researchers that have sought to overcome this limitation. We group the presented methods broadly by the approach the authors took. This is preceded by an introduction to the notation we use for the representation of time-series data.

Going forward we use the following notation. \(o_{t,j} \in \mathbb {R}\) denotes the value of feature j at time step t. \(o_t = o_{t,1:F} \in \mathbb {R}^F\) represents the vector of observations for all F features at time step t and \(o_{1:T} \in \mathbb {R}^{T \times F}\) stands for the entire multivariate time-series with T time steps and F features. In the case where only a single feature is observed, the second index is omitted and \(o_{1:T} \in \mathbb {R}^T\) represents a univariate time-series with T time steps.

A single-step ahead time-series forecast predicts \(y = o_{T+1}\) given a univariate time-series \(x = o_{1:T}\). If \(x = o_{1:T}\) is a multivariate time-series, the single-step ahead forecast predicts some or all features of \(y = o_{T+1,1:F}\). In contrast, a multi-step ahead time-series forecast (called multi-horizon time-series forecast by some authors (Stankevičiūtė et al., 2021)) predicts the values \(y = o_{T+1:T+T_h}\) for multiple future time steps in the univariate setting. In the multivariate setting, it predicts the value for some or all features of these future time steps \(y = o_{T+1:T+T_h}\). The number of predicted future time steps \(T_h\) is called the prediction horizon.

2.2.1 Methods relying on block-wise exchangeability

Chernozhukov et al. (2018) and Stankevičiūtė et al. (2021) soften the exchangeability condition by requiring the data to be block-wise exchangeable. Chernozhukov et al. (2018) first split the time-series into blocks containing an equal number consecutive observations. These blocks are permuted while the values within the block retain their order to preserve the dependence structure of the time-series. They treat the elements between the last known observation and the prediction horizon as one such block. The authors prove that their method remains approximately valid under weak conditions on the nonconformity score, relaxing the exchangeability condition of the data. However, the method requires an infinite number of hypothetical labels \(\bar{y}\) to be tested for regression tasks, and is not developed further to benefit from the improvements in computational efficiency presented in Subsect.. 2.1.2.Footnote 2 For exact validity, their method requires the examples to be exchangeable under the permutations \(\pi\), a condition which might not be met for time-series with long term dependencies surpassing the block size.

Stankevičiūtė et al. (2021) base their CF-RNN method on blocks of consecutive observations that do not belong to the same time-series but are drawn from the same distribution. This makes it impractical for a single multivariate time-series. Subsection 2.3 contains a more detailed discussion of CF-RNN.

2.2.2 Adaptive methods

Instead of constraining the observations, Xu and Xie (2021) make assumptions about the time-series’ forecasts’ stochastic errors and the estimation quality of the underlying models to build distribution-free prediction intervals with approximate coverage guarantee. In practice, they create B bags \(\mathcal {D}_b\) of size n by sampling with replacement from the known examples \(z_{1:n}\). Next, they fit B underlying models \(\hat{\mu }_{b} = h_{\mathcal {D}_b}, \forall b \in B\). The residuals

$$\begin{aligned} r_{i} = |y_{i} - \phi (\{\hat{\mu }_{b}(x_i)\ \forall b: z_i \notin \mathcal {D}_b\})| \end{aligned}$$
(3)

are the absolute difference between the real label \(y_i\) and the aggregated predictions of all underlying models that were not trained on \(z_i\). \(\phi\) is the chosen aggregation function. Finally, they produce the prediction interval

$$\begin{aligned} \hat{\textbf{y}}_{n+1} = \textsf{Q}_{1 - \alpha }( \{\hat{\mu }_{b}(x_{n+1})\ \forall b: z_i \notin \mathcal {D}_b\}_{i=1}^{n} ) \pm \textsf{Q}_{1 - \alpha } ( r_{1:n}) \end{aligned}$$
(4)

for a new object \(x_{n+1}\). In addition, they implement a mechanism to add the residual \(r_{n+1}\) when the label for a new object becomes available and to remove the oldest residual. This sliding window applied to the residuals used in the quantile function in Eq. (4) allows the method to adapt over time. Xu and Xie (2021) prove that EnbPI produces valid prediction intervals under two assumptions. The first assumption is that the \(\{r_{1:n}\}\) are stationary and strongly mixing. The second assumption states that the average difference between the residuals and the stochastic part of the data generating process decreases over time. Xu and Xie (2021) test their EnbPI method on single-step ahead and multi-step ahead solar power predictions tasks. They compare it to the ARIMA method, which does not retain valid coverage in the experiments.

2.2.3 Weighted methods

Gibbs and Candès (2021) construct their adaptive conformal inference (ACI) method explicitly with a data distribution that shifts over time in mind, moving away from the exchangeability assumption. They do this by introducing two additions. The first addition is increasing or decreasing the significance level \(\alpha\) over time, based on ACI’s performance, making it an adaptive method. The second addition is weighing the nonconformity scores in the quantile function, laying the foundation for other weighted methods. Although Gibbs and Candès (2021) describe ACI in more general terms, we only show the situation where an underlying model is used to produce point forecasts in the semi-off-line setting to keep the notation consistent. In this situation initially n examples \(z_{1:n}\) are known. We split them into two disjoint sets, a proper training set of size \(m_{0}\) and a calibration set of size \(m_{c} = n-m_{0}\). After fitting the underlying model \(\hat{\mu } = h_{\mathcal {D}_{train}}\) to the training set, we use it to compute the nonconformity scores \(r_{m_{0}+1:n}\) on the calibration set. For future objects \(x_{n+i}\) that nature provides sequentially, ACI computes the prediction interval

$$\begin{aligned} \hat{\textbf{y}}_{n+i} = \hat{\mu }(x_{n+i}) \pm \textsf{Q}_{1-\alpha _{n+i}}\left( \sum _{j=m_{0}+1}^{n+i-1} \tfrac{1}{m_{c}+i} \cdot \delta _{r_{j}} + \tfrac{1}{m_{c}+i} \cdot \delta _{+\infty} \right) \end{aligned}$$
(5)

with

$$\begin{aligned} \alpha _{n+i} = \alpha _{n+i-1} + \gamma (\alpha - \text {err}_{n+i-1}) \end{aligned}$$
(6)

starting at \(\alpha _{n+1} = \alpha\). An alternative version of the method

$$\begin{aligned} \hat{\textbf{y}}_{n+i} = \hat{\mu }(x_{n+i}) \pm \textsf{Q}_{1-\alpha _{n+i}}\left( \sum _{j=m_{0}+1}^{n+i-1} \tfrac{\tilde{w}_j}{m_{c}+i} \cdot \delta _{r_{j}} + \tfrac{\tilde{w}_{n+i}}{m_{c}+i} \cdot \delta _{+\infty} \right) \end{aligned}$$
(7)

Gibbs and Candès (2021) present under the same name introduces a series of increasing weights \(\tilde{w}_j\) on the nonconformity scores \(r_{j}\) with \(\sum \tilde{w}_j = 1\). It uses a weighted quantile function, giving more recent nonconformity scores a higher weight compared to older ones. In both cases, the step size parameter \(\gamma\) determines how quickly the method adapts to shifts in distribution. Gibbs and Candès (2021) define the

$$\begin{aligned} \text {coverage gap} = (1 - \alpha ) - \textsf{P} \left\{ y_{n+i} \in \hat{\textbf{y}}_{n+i} \right\} \end{aligned}$$
(8)

as the loss in coverage compared to what would be achieved if the examples were exchangeable. They show that by correctly calibrating \(\alpha _{n+i}\), the coverage gap can be made arbitrarily small. In addition, they prove that using the on-line update scheme described in Eq. (6), ACI is guaranteed to be asymptotically valid. This stems from the observation that \(\forall m_1 \in \mathbb {N}\)

$$\begin{aligned} \left| \frac{1}{m_1} \sum _{i=n+1}^{n+m_1} \text {err}_i - \alpha \right| \le \frac{\text {max} \{ \alpha _{n+1}, 1 - \alpha _{n+1} \} + \gamma }{m_1 \gamma } \end{aligned}$$
(9)

by taking \(\lim _{m_1 \rightarrow +\infty }\) (Gibbs & Candès, 2021). To avoid discontinuity and prove these theoretical guarantees the quantile functions in Eqs. (5) and (7) use Dirac delta instead of the Kronecker delta introduced in Eq. (2).

Zaffran et al. (2022) provide an extensive analysis of the validity, the efficiency and the general behavior of ACI in various situations. They argue that forecasting of dependent examples, even in the absence of distribution shifts, can benefit from the theoretical framework created by Gibbs and Candès (2021). In addition, they propose two methods to automatically select a value for the learning rate \(\gamma\). The first one consists in computing the quantiles for \(K \in \mathbb {N}\) different values of \(\gamma\) and, for every new object, choosing the one that has produced the smallest intervals in the past while retaining validity. The second one is called Online Expert Aggregation on ACI (AgACI). Instead of just choosing the learning rate \(\gamma\) that performed best in the past, the intervals for different values of \(\gamma\) are aggregated using an on-line aggregation rule. Zaffran et al. (2022) compare their method to ACI, EnbPI and semi-off-line ICP. They show that simply choosing the value for \(\gamma\) that was most efficient in the past does not lead to valid results. In their experiments EnbPI and semi-off-line ICP lose validity, while AgACI is more efficient than ACI and (almost) retains validity.

Feldman et al. (2022) present a methods they call rolling conformal inference (Rolling CI) that aims to avoid splitting the known examples into a proper training set and a calibration set as is done for ICP. It builds directly upon ACI by removing the calibration set and and adapting the interval width in a fully on-line manner. Assuming n known examples \(z_{1:n}\), Rolling CI uses all of them to fit the underlying model . It then constructs the prediction region

$$\begin{aligned} \hat{\textbf{y}}_{n+i} = f (x_{n+i}, \theta _{n+i}, \hat{\mu }_{n+i}) \end{aligned}$$
(10)

for every new object \(x_{n+i}\) that is observed sequentially using an interval construction function f that takes the new object \(x_{n+i}\), a calibration parameter \(\theta _{n+i}\) and the current underlying model \(\hat{\mu }_{n+i}\) as inputs. The calibration parameter

$$\begin{aligned} \theta _{n+i+1} = \theta _{n+i} + \gamma (\text {err}_{n+i} - \alpha ) \end{aligned}$$
(11)

is updated after the label of \(y_{n+i}\) is revealed. This calibration parameter \(\theta _{n+i}\) controls the size of the predicted region, similar to \(\alpha _{n+i}\) in the ACI method. Finally, Rolling CI updates the underlying model to \(\hat{\mu }_{n+i+1}\) using a single gradient step if the model supports it. Feldman et al. (2022) show, that for Rolling CI to produce valid intervals according to

$$\begin{aligned} \lim _{m_1 \rightarrow +\infty } \frac{1}{m_1} \sum _{i = 1}^{m_1} \text {err}_{n+i} = \alpha \end{aligned}$$
(12)

the only condition is the existence of two constants \(c_1, c_2\) such that \(f(x, \theta , \hat{\mu })\) returns an empty interval for \(\theta < c_1\) and \(\mathcal {Y}\) for \(c_2 < \theta\). These requirements are fulfilled by the quantile functions in Eqs. (5) and (7) of ACI. They also show that

$$\begin{aligned} \left| \frac{1}{m_1} \sum _{i = 1}^{m_1} (\text {err}_{n+i} - \alpha ) \right| \le \frac{\text {max}\{ \theta _{n+1} - c_1, c_2 - \theta _{n+1} \} + \gamma }{m_1 \gamma } \end{aligned}$$
(13)

for a finite number of examples \(m_1\). In contrast to the other methods, Rolling CI can directly be applied to a multi-output regression setting and therefore also to a multi-step ahead time-series prediction setting. In this case, the parameter \(\theta\) inflates or deflates the predicted region in all dimensions.

Barber et al. (2022) directly address the two exchangeability conditions of the original conformal prediction method. Their non-exchangeable conformal prediction (nexCP) method does not require the examples to be exchangeable and does not require that they are treated symmetrically by the algorithm fitting the underlying model. A symmetric model fitting algorithm produces the same modelFootnote 3 independently of the order of the examples in the training set. Barber et al. (2022) apply their developments to CPs, ICPs, and the jackknife+ method. Due to its computational efficiency, we focus on the adaptation of ICPs. In the semi-off-line setting, the method splits the \(n = m_0 + m_c\) known examples into a proper training set \(\mathcal {D}_{train}\) of size \(m_0\) and a calibration set \(\mathcal {D}_{cal}\) of size \(m_c\). The underlying model \(\hat{\mu } = h_{\mathcal {D}_{train}}\) is fitted to the proper training set and the calibration set is used to compute the residuals \(r_{m_0+1:n}\). By using the weighted quantile function introduced in Eq. (7), the method adapts the generated prediction intervals to shifts in distribution without changing the value of \(\alpha\) that is passed to the quantile function. This is achieved by assigning residuals of more recent examples higher weights than residuals of older ones. Assuming a new object \(x_{n+1}\) with the associated label \(y_{n+1}\), Barber et al. (2022) show that the coverage gap as defined in Eq. (8) is bound by

$$\begin{aligned} \text {coverage gap} \le \frac{\sum _{i=m_0+1}^{n} w_i \cdot \textsf{d}_{\textsf{TV}} (\mathcal {D}, \mathcal {D}^{[i]})}{1 + \sum _{i=m_0+1}^{n} w_i} \end{aligned}$$
(14)

where \(\mathcal {D} = \mathcal {D}_{cal} \cup \{ z_{n+1} \}\) and \(\mathcal {D}^{[i]}\) swaps elements i and \(n+1\) in \(\mathcal {D}\). \(\textsf{d}_{\textsf{TV}}\) is the total variation distance. This means more generally, that by reducing the weights \(w_i\) associated with residuals of examples that contribute to an increase in the coverage gap, the validity of the predicted intervals can be improved. While a careful selection of the weights \(w_i\) leads to a reduction in the coverage gap and thereby addresses the exchangeability of the examples it does not lift the requirement, that the model fitting algorithm for the underlying model treats the examples symmetrically. This can be ignored for ICPs as the solution to lift this restriction proposed by Barber et al. (2022) does not change the method beyond what has already been discussed.Footnote 4 Barber et al. (2022) show further that

$$\begin{aligned} \textsf{P}(y_{n+1} \in \hat{\textbf{y}}_{n+1}) \ge 1 - \alpha - \sum _{i=m_0+1}^{n} \tilde{w}_i \cdot \textsf{d}_{\textsf{TV}} (r_{1:n}, r_{1:n}^{[i]}) \end{aligned}$$
(15)

where the weights \(\tilde{w}_i = w_i /\left( \sum _{i=m_0+1}^{n} w_i \right)\) on the nonconformity scores are normalized weights, \(r_{1:n}\) are the residuals computed on \(\mathcal {D}\) and \(r_{1:n}^{[i]}\) the residuals computed on \(\mathcal {D}^{[i]}\). By setting \(w_i = 1, \ \forall i \in \{m_0+1, \dots , n\}\), these results also give an upper limit for the miscoverage rate of the original ICP in case the exchangeability condition is violated. Barber et al. (2022) test their method applied to a CP in three settings, comparing it to the standard CP. In the first setting, using i.i.d. examples, they show that their method performs similarly to the standard CP. In the second and third setting, the examples present change points and distribution shift respectively, and nexCP retains approximately valid coverage while the standard CP does not.

2.3 Conformal multi-target regression

The work presented in Subsect. 2.2 addressed the non-exchangeability of time-series data and how conformal prediction can be adapted to produce valid prediction intervals in the presence of non-exchangeable data. This subsection treats the difficulties that arise when applying conformal prediction to multi-target regression and some proposed solutions.

In the case of single-target regression, the quantile function \(\textsf{Q}_{1-\alpha }\) in Eq. (2) returns the \(\lceil (1 - \alpha )(m_{c}+1) \rceil\)th smallest value of the residuals, denoted as \(r_j\) here. For an object \(x_{n+1}\), the prediction region is constructed by selecting all possible labels y such that the associated nonconformity score satisfies \(r_{n+1} \le r_j\). When using the absolute residuals as NCM, this is achieved by simply solving

$$\begin{aligned} |y - \hat{\mu }(x_{n+1})| \le r_j. \end{aligned}$$
(16)

Solving Eq. (16) for y produces exactly one region, the prediction interval \([\hat{\mu }(x_{n+1}) - r_j, \hat{\mu }(x_{n+1}) + r_j]\), as solution. This shows that there exists a bijective relationship between the upper and lower bound of the prediction interval and the nonconformity score \(r_{n+1}\) necessary to reach the desired coverage rate.

In the multi-target regression setting, i.e. for a label space \(\mathcal {Y} \subset \mathbb {R}^{F}\),Footnote 5 replacing the absolute residuals by the Euclidean distance as NCM is a natural choice. This allows for a procedure that is similarly computationally efficient to the one outlined for the single-target regression case. Solving Eq. (16) for this generalized NCM results in a prediction region bound by a hyper-sphere centered around \(\hat{\mu }(x_{n+1})\) and with a radius of \(r_j\). While simple to compute, such a prediction region does not provide the user with any information about the error the underlying model makes in the various dimension of the label space. Instead, it assumes the underlying model is equally likely to make an error in any direction. This can lead to an inefficient prediction region, with a volume larger than necessary, if the underlying model’s errors are generally larger in some directions than in others. In addition, a prediction region in the shape of a hypersphere is difficult to visualize if the label space has more than three dimensions. While Messoudi et al. (2020) note that “[...] conformal prediction for multi-target regression is still a largely unexplored area”, some propositions have been put forward.

Vovk (2013) and later Stankevičiūtė et al. (2021) circumvent the issue of applying conformal prediction to multi-target regression by calibrating one CP for every dimension in the label space, reverting to single target regression and a one-dimensional label space for each. Equation (19) shows how they combine the individual prediction intervals generated by these CPs for every dimension into a multi-dimensional prediction region. Stankevičiūtė et al. (2021) explicitly extend ICP to univariate multi-horizon time-series forecasts. In this situation, the objects

$$\begin{aligned} x_i = o_{1:T_w}^{(i)} \in \mathbb {R}^{T_w} \end{aligned}$$
(17)

are the observations made over \(T_w\) time steps and the labels

$$\begin{aligned} y_i = o_{T_w+1:T_w+T_h}^{(i)} \in \mathbb {R}^{T_h} \end{aligned}$$
(18)

are the observations made during the following \(T_h\) time steps. The superscript (i) indicates that the observation \(o^{(i)}\) belongs to the ith example. Given a set of exchangeable examples \(z_{1:n}\), they follow the ICP procedure and split them into a proper training set \(\mathcal {D}_{train}\) of size \(m_0\) and a calibration set \(\mathcal {D}_{cal}\) of size \(m_c\). Then they fit an underlying model \(\hat{\mu } = h_{\mathcal {D}_{train}}: \mathbb {R}^{T_w} \rightarrow \mathbb {R}^{T_h}\) to the training set and compute the residuals \(r_i = |y_i - \hat{\mu }(x_i)| \in \mathbb {R}^{T_h}\) for the calibration set. For a new object \(x_{n+1}\) the predicted region

$$\begin{aligned} \hat{\textbf{y}}_{n+1} = \left[ \hat{\mu }(x_i)_{j} \pm \textsf{Q}_{1 - \alpha / T_h}( r_{m_0+1:n,j} ), \ \forall j \in \{ 1, \dots , T_h \} \right] \in \mathbb {R}^{T_w \times 2} \end{aligned}$$
(19)

is a \(T_h\) dimensional hyper-rectangle, where every dimension corresponds to one time step in the prediction horizon. \(T_h\) conformal predictors construct this hyper rectangle, one for each dimension. These conformal predictors use only the residuals associated with that particular time step in the prediction horizon from the examples in the calibration set in their quantile function. Stankevičiūtė et al. (2021) apply the Bonferroni Correction, setting the confidence level of the quantile function to \(1 - \alpha / T_h\) instead of \(1 - \alpha\). This guarantees, that the predicted region is valid in general. This approach allows for dependence within the \(o_{1:T_w+T_h}^{(i)}\) that form an example \(z_i\), as only the \(z_i\) have to be exchangeable. Designed with recurrent neural networks (RNNs) as underlying models in mind, Stankevičiūtė et al. (2021) call their method CF-RNN. They test their method for different RNN based underlying models on synthetic and real world data sets and show that the produced intervals are valid.

Messoudi et al. (2020) try to account for the interactions of errors in each dimension of the label space through the use of different NCMs. Similar to Vovk (2013) and Stankevičiūtė et al. (2021), the produced prediction region is a hyper-rectangle, however Messoudi et al. (2020) employ the Šidák correction, which assumes independence and which is a little less stringent than the Bonferroni correction. They use the median volume of the produced prediction regions as metric for the efficiency of the compared NCMs. While the authors identify general trends, the best NCM depends on the data set.

2.4 Discussion

Methods generating prediction intervals for time-series should, as is the case for conformal predictors in general, be valid and efficient. Due to the nature of time-series, such a methods need to support non-exchangeable data and produce intervals for more than one feature and more than one future time step. Table 1 offers a comparison of the methods discussed in Sect. 2 according to these criteria. Ideally, it would also be computationally efficient (or at least feasible in practice) and user-friendly by providing guidance on how to set the method’s parameters.

Table 1 Comparison between the different methods discussed in Sect. 2. Coverage guarantee column displays the long-term theoretical coverage guarantee the original authors provide for their methods. Multi-target column displays if the method supports multi-target regression. NE data column displays if the method retains the coverage guarantee in the presence of non-exchangeable data. Symbol ✽ indicates methods that allow for short term dependencies, but lose their validity guarantee if applied to time-series with long term dependencies

While other methods to convey uncertainty in the form of prediction intervals exist Lin et al. (2021), they do not come with the validity guarantees that some methods derived from CPs offer and that are desired in high stakes situations (Feldman et al., 2022). Gibbs and Candès (2021) have shown that these other methods can be used as underlying models for CP based methods. Because of this, we have chosen to focus on CP based methods.

As mentioned at the beginning of Subsect. 2.2, the lack of exchangeability in time-series data means that the conditions for the validity guaranteed by CP and ICP are not met. CIbP relaxes this condition by splitting the data into blocks consisting of values of consecutive time steps. Chernozhukov et al. (2018) assume that these blocks are exchangeable. By regarding each block as a point in a multidimensional space instead of a sequence of consecutive scalar values, with each value corresponding to a time step, the values within the block no longer need to be exchangeable. While some time-series may present such a block wise exchangeability, it is not given in general, especially if the time-series displays long term dependencies. Another drawback of the proposed method is that by evaluating all possible values for all dimensions of the label space, the method considers an infinite number of label candidates and is therefore not applicable in practice. CF-RNN overcomes the computational complexity of CIbP by estimating the intervals for every dimension in the label space individually and correcting for the family-wise error rate. CF-RNN relies on a block wise exchangeable structure in the same way as CIbP making it unsuitable for time-series with long term dependencies or a change in distribution over time.

EnbPI takes a different approach to relaxing the exchangeability condition for the examples by requiring the residuals to be stationary and strongly mixing. This assumption is only met if the underlying model maintains the quality of its predictions when the distribution of the data changes over time. Because this is difficult to prove in general, EnbPI is listed as not retaining coverage in presence of non-exchangeable data in Table 1. Xu and Xie (2021) apply EnbPI to the multi-step ahead setting by predicting the future time steps in the prediction horizon individually. If the method’s conditions are met, this approach guarantees validity for every individual dimension of the label space, but it does not guarantee validity in general. This is why Table 1 lists EnbPI as not supporting multi-target regression.

By growing or shrinking the generated intervals according to the coverage errors it has made in the past, ACI achieves the desired coverage frequency in the long run without making any assumptions about the data or the errors the underlying model makes. This achievement comes at the cost of a quantile function that can return infinite values resulting in a predicted region that covers the entire label space which is uninformative. The version of the method using the unweighted quantile function requires only a single additional parameter, the learning rate \(\gamma\), adding to its ease of use. ACI is also computationally efficient as it takes advantage of all advancements made for ICPs. The method is developed for single-target forecasts, meaning it suffers from the challenges discussed in Subsect. 2.3 when applied to multivariate multistep-head time-series forecasts.

Multiple extensions for ACI are proposed. AgACI automatically estimates the learning rate \(\gamma\) thereby achieving better efficiency. This requires the choice of an on-line aggregation method and does not address any of the other shortcomings of ACI. Rolling CI is the second extension of ACI. It does not require a split of the known examples, allowing the underlying model to be fitted to all of them. Rolling CI also allows for multi-target forecasts by growing or shrinking the predicted region in all dimensions. If a shift in the distribution only occurs for a subset of features, this might lead to overly conservative intervals for the others.

nexCP could be seen as a third extension of ACI. It departs from the idea of adapting the \(\alpha\) used in the quantile function based on the observed error rate of the method, and focuses instead on the weighted quantile function. This approach comes with strong theoretical guarantees. However, just as ACI, nexCP may produce infinite prediction intervals and faces the same the challenges when applied to multivariate multistep-head time-series forecasts. In addition, more research is required to determine the optimal weights for the weighted quantile function.

Applying conformal prediction to multivariate multistep-head time-series forecasts poses two challenges. The first being that time-series are generally non-exchangeable, and the second, that it is not sufficient to produce a prediction interval but instead a predictive region needs to cover all dimensions of the label space. Different methods have been proposed to address these challenges individually. And while Rolling CI addresses both simultaneously, we believe that it can further be improved as it adapts the predicted region in all dimensions over time without considering which features are responsible for a change in coverage rate.

3 nmtCP method

This section aims to address some of the gaps in previous methods we presented in Subsect. 2.4. Subsection 3.1 lays out our proposal on how to address these shortcomings. Subsection 3.2 takes this idea, turns it into a method and explains how to implement it in practice. This is followed by a theoretical analysis of the validity of the method in Subsection 3.3 and a discussion and a comparison with related methods in Subsection 3.4.

3.1 Idea

Applying the concepts of conformal prediction to multistep-ahead multivariate time-series forecasting faces two main obstacles. Subsection 2.2 shows the challenges that the nonexchangeability of the data in time-series poses and discuss methods to overcome them. Subsection 2.3 briefly explains the additional difficulty introduced when moving from single target to multi-target regression, necessary to achieve multistep-ahead multivariate forecasts.

In this subsection, we propose a new method that addresses both of these obstacles by combining two existing methods. On the one hand, Barber et al. (2022) has provided a method that relaxes the exchangeability assumption while maintaining strong validity guarantees that is computationally efficient for single target regression. On the other hand, Stankevičiūtė et al. (2021) shows how multiple conformal predictors can be combined to produce prediction regions for multi-target regression. We merge these two approaches, replacing the CPs in the method of Stankevičiūtė et al. (2021) with nexCPs. This leads to a method that produces prediction intervals for multi-step ahead multivariate time-series forecasts for any underlying model producing point forecasts for every dimension in the label space. Henceforth, we refer to this method as non-exchangeable multi-target conformal prediction or nmtCP.

3.2 Implementation

This subsection explains how we put the idea outlined in Subsect. 3.1 into practice, starting with the representation and preprocessing of the time-series data in Subsect. 3.2.1. It also expands the notation introduced in Subsect. 2.2. Next, Subsect. 3.2.2 provides a detailed look into the generation of the predictive region in the semi-off-line setting. Finally, other settings are discussed in Subsect. 3.2.3.

3.2.1 Preprocessing

When loading the data, it is generally stored in a matrix o, where each line contains the values of the observed features for a specific time step and each column contains all measurements for a specific feature. This matrix can be represented as \(o_{1:T,1:F}\), where T is the total number of time steps and F the total number of features.

$$\begin{aligned} \begin{matrix} o_{1,1} & o_{1,2} & \dots & o_{1,F} \\ o_{2,1} & o_{2,2} & \dots & o_{2,F} \\ \vdots & \vdots & \ddots & \vdots \\ o_{T,1} & o_{T,2} & \dots & o_{T,F} \end{matrix} \end{aligned}$$
(20)

We split this raw data set into examples \(z_i = (x_i, y_i)\) consisting of objects \(x_i\) and their corresponding labels \(y_i\). The objects \(x_i\) are created following a sliding window scheme. The ith object consists of

$$\begin{aligned} x_i = o_{i\cdot s + 1:i \cdot s+T_w, F_w} \end{aligned}$$
(21)

starting at \(i \cdot s + 1\) where s is the stride, describing the number of time steps between the starting points of two consecutive objects, i.e. windowsFootnote 6. \(T_w\) is the number of time steps each object contains and \(F_w \subset F\) is the set of input features for the model. Consecutive objects overlap, unless \(s \ge T_w\). Equation (23) visualizes the object \(x_i\) as a blue block. The label associated with the ith object is

$$\begin{aligned} y_i = o_{i\cdot s + T_o + 1:i\cdot s + T_o + T_h, F_h} \end{aligned}$$
(22)

where \(T_o\) is the temporal offset between the start of the object and its associated label, \(T_h\) is the number of time steps in the prediction horizon and \(F_h \subset F\) is the set of features predicted by the model. The green block represents \(y_i\) in eq. (23).

(23)

In the example presented in eq. (23), \(T_o = T_w\). Therefore, the label starts at the first time step immediately after the object. In addition, \(F_w\) and \(F_h\) are disjoint. If \(T_o < T_w\) objects and associated labels overlap temporally. If \(F_w\) and \(F_h\) are not disjoint, objects and associated labels have overlapping features.

After processing the raw data set in this manner, \(\mathcal {D} = z_{1:n}\) designates the data set consisting of n examples. We note that the data set \(\mathcal {D}\) is a ordered sequence of examples.

3.2.2 Semi-off-line setting

Three steps are necessary to apply the idea in the semi off-line setting.

Step 1: Training. After obtaining the data set \(\mathcal {D}\), we partition it into a proper training set \(\mathcal {D}_{train}\) and a calibration set \(\mathcal {D}_{cal}\). \(m_0 = |\mathcal {D}_{train}|\) and \(m_1 = |\mathcal {D}_{cal}|\) designate the sizes of the proper training set and the test set respectively with \(m_0 + m_1 = n\). The partition of the data set is drawn along the time axis and the elements within \(\mathcal {D}_{train}\) and \(\mathcal {D}_{cal}\) are not shuffled. Next, we derive the underlying model \(\hat{\mu }\) on \(\mathcal {D}_{train}\).

Algorithm 2
figure b

nmtCP in the semi-off-line setting. The underlying model is only trained once but the calibration set grows with every new label that is provided to the method.

Step 2: Initial calibration. We use the fitted model \(\hat{\mu }\) to generate predictions \(\hat{y}_i = \hat{\mu }(x_i)\) for all elements \(x_i \in \mathcal {D}_{cal}\). The chosen NCM A as defined in Subsect. 2.1.1 computes a nonconformity score \(r_{i,t,f} = A (y_{i,t,f}, \hat{y}_{i,t,f})\) for every dimension of \(\mathcal {Y}\). We use the absolute residuals as NCM. These residuals are then used to calibrate \(T_h \times F_h\) nexCPs (Barber et al., 2022) denoted by

$$\begin{aligned} \Gamma ^{\alpha '}_{t,f}, \quad \forall t = 1, \dots , T_h, \ \forall f = 1, \dots , F_h. \end{aligned}$$
(24)

For any combination of fixed t and f, the associated nexCP \(\Gamma ^{\alpha '}_{t,f}\) initially uses the nonconformity scores \(r_{m_0:n,t,f}\). This means that the nexCP associated with one dimension of the labels space is calibrated exclusively on the nonconformity scores computed on that particular dimension of the label space. Instead of using \(\alpha\) as the targeted miscoverage rate, we apply the Bonferroni Correction resulting in

$$\begin{aligned} \alpha ' = \frac{\alpha }{T_h \cdot F_h} \end{aligned}$$
(25)

to account for the family-wise error rate.

Step 3: Inference. Presented with a new object \(x_i, i > n\), the underlying model generates a prediction for the associated label \(\hat{y}_i = \hat{\mu }(x_i)\). Then the \(T_h \times F_h\) nexCPs \(\Gamma ^{\alpha '}_{1:T_h,1:F_h}\) compute an interval for every dimension in the label space. They do this by applying the weighted quantile function to the absolute residuals \(r_{m_0:i-1,t,f}\) of the associated dimension (tf). We combine the predicted intervals to form the prediction region

$$\begin{aligned} \begin{aligned} \hat{\textbf{y}}_i = [\ &\hat{y}_{i,t,f} \mp \textsf{Q}_{1-\frac{\alpha }{T_h \cdot F_h}}(r_{m_0:i-1,t,f}, \tilde{w}_{m_0:i-1}), \\&\forall t \in \{ 1, \dots , T_h \}, \ \forall f \in \{ 1, \dots , F_h \} \ ] \end{aligned} \end{aligned}$$
(26)

with \(\hat{\textbf{y}}_i \in \mathbb {R}^{T_h \times F_h \times 2}\) and where \(\tilde{w}_{m_0:i-1}\) is a vector of weights on the nonconformity scores. In the semi-off-line setting nature then provides the true label \(y_i\), which we use to compute the residuals \(r_{i,1:T_h,1:F_h}\). These new residuals are added to the set of known residuals for their respective dimension and will be used when the algorithm generates the prediction region for the next example \(n_{i+1}\).

Algorithm 2 describes these three steps in a way that can directly be implemented.Footnote 7 In the presence of a pretrained model \(\hat{\mu }\), the first step can be omitted entirely and the entire initial data set \(\mathcal {D}\) can be used as calibration set. This holds as long as none of the examples from the data set \(\mathcal {D}\) were used during training of the model \(\hat{\mu }\).

3.2.3 On-line, semi-on-line and off-line setting

The method also works in the other settings laid out in Subsect. 2.1.2. Only minor changes to the steps shown in Subsect. 3.2.2 are necessary.

In the on-line setting, the partition between proper training set \(\mathcal {D}_{train}\) and the calibration set \(\mathcal {D}_{cal}\) is redrawn every time a new label becomes available. The underlying model is then fitted to the new training set \(\mathcal {D}_{train}\) and the nexCPs \(\Gamma ^{\alpha '}_{1:T_h,1:F_h}\) are calibrated on \(\mathcal {D}_{cal}\). The prediction step omits the computation of the residuals when nature provides a new label as this action is already covered by the calibration step. If the underlying model is computationally expensive to fit, this mode of operation might become prohibitively costly.

The semi-on-line setting is a hybrid of the on-line and semi-off-line setting. As new examples become available, they are periodically split into a proper training set \(\mathcal {D}_{train}\) and a calibration set \(\mathcal {D}_{cal}\) along the time axis. When this happens, the underlying model is fitted to the new training set \(\mathcal {D}_{train}\) and the nexCPs \(\Gamma ^{\alpha '}_{1:T_h,1:F_h}\) are calibrated on \(\mathcal {D}_{cal}\). Then the method operates in the semi-off-line setting until the examples are partitioned again and the underlying model is retrained.

Although the method can operate in the off-line setting, it is not useful, as the information provided by new labels remains unused. Over time, it would not adapt to distribution shifts in the data.

3.3 Validity

Theorem 1

Let \(\hat{\mu }\) be any pre-fitted model. Then the nmtCP method defined in (26) satisfies

$$\begin{aligned} \textsf{P} \left( y_{n+1} \notin \hat{\textbf{y}}_{n+1} \right) \le \alpha + \sum _{t=1}^{T_h} \sum _{f=1}^{F_h} G_{n_0 + 1:n,t,f} \end{aligned}$$
(27)

where \(G_{n_0 + 1:n,t,f} = \sum _{i = n_0 + 1}^{n} \tilde{w}_i \cdot \textsf{d}_{\textsf{TV}} (r_{n_0 + 1:n,t,f}, r_{n_0 + 1:n,t,f}^{[i]})\) is the coverage gap and \(\tilde{w}_i\) are the weights on the nonconformity scores.

Proof

We view each nexCP as having been calibrated on a dedicated calibration set for the associated dimension. For nexCPs Barber et al. (2022) state in Theorem 2a that in the inductive conformal prediction setting

$$\begin{aligned} \textsf{P} \left( y_{n+1,t,f} \in \hat{\textbf{y}}_{n+1,t,f} \right) \ge 1 - \frac{\alpha }{T_h \cdot F_h} - G_{n_0 + 1:n,t,f} \end{aligned}$$
(28)

for a new object \(n_{n+1}\), where \(r^{[i]}\) swaps elements i and n in r. This is equivalent to

$$\begin{aligned} \textsf{P} \left( y_{n+1,t,f} \notin \hat{\textbf{y}}_{n+1,t,f} \right) \le \frac{\alpha }{T_h \cdot F_h} + G_{n_0 + 1:n,t,f} \end{aligned}$$
(29)

providing an upper bound for the miscoverage rate for every dimension in the label space. Using Boole’s inequality,

$$\begin{aligned} \textsf{P} \left( y_{n+1} \notin \hat{\textbf{y}}_{n+1} \right)&= \textsf{P} \left( \cup _{t=1}^{T_h} \cup _{f=1}^{F_h} (y_{n+1,t,f} \notin \hat{\textbf{y}}_{n+1,t,f}) \right) \end{aligned}$$
(30)
$$\begin{aligned}&\le \sum _{t=1}^{T_h} \sum _{f=1}^{F_h} \textsf{P} \left( y_{n+1,t,f} \notin \hat{\textbf{y}}_{n+1,t,f} \right) \end{aligned}$$
(31)
$$\begin{aligned}&\le \sum _{t=1}^{T_h} \sum _{f=1}^{F_h} \frac{\alpha }{T_h \cdot F_h} + G_{n_0 + 1:n,t,f} \end{aligned}$$
(32)
$$\begin{aligned}&= \alpha + \sum _{t=1}^{T_h} \sum _{f=1}^{F_h} G_{n_0 + 1:n,t,f} \end{aligned}$$
(33)

bounds the miscoverage rate for the predicted region constructed from the individual nexCPs’ intervals. \(\square\)

Equation (27) shows that the upper bound on the miscoverage rate of the proposed method depends on the targeted miscoverage rate \(\alpha\) and the sum of the coverage gaps \(G_{n_0 + 1:n,t,f}\) of the individual nexCPs. The coverage gaps \(G_{n_0 + 1:n,t,f}\) of the individual nexCPs depend on the chosen weights \(\tilde{w}_i\). This means that if the weights are chosen carefully, the coverage gaps \(G_{n_0 + 1:n,t,f}\) decrease and the method becomes approximately valid (Oliveira et al., 2022) for slow shifts in distribution. Optimising the weighs \(\tilde{w}_i\) directly is not possible, as the computation of the total variation distance \(\textsf{d}_{\textsf{TV}}\) requires knowledge of the true label \(y_{n + 1}\). However, assigning higher weights to past examples similar to the one for which the method is predicting the label and lower ones to past examples that are less similar leads to a decrease in coverage gap \(G_{n_0 + 1:n,t,f}\) and give the weights their intuitive interpretation (Barber et al., 2022).

The theoretical upper bound on the method’s miscoverage rate presented in Eq. (27) does not require the same weights \(\tilde{w}_i\) for every dimension. While this gives the user a lot of control over the method, we did not explore this possibility in the context of this article. Following the advice of Barber et al. (2022), we opted for a few simple weight generating functions for the experiments presented in Sect. 4.

3.4 Comparison

This subsection compares nmtCP to the original CP (Vovk et al., 2005) and highlights some key differences with the methods discussed in Sect. 2.

In contrast to the original ICP, nmtCP does not require the underlying data to be exchangeable, and remains computationally efficient even while being applied in a multi-target regression setting. This comes at the cost of a weakened validity guarantee.

CIbP (Chernozhukov et al., 2018) and CF-RNN (Stankevičiūtė et al., 2021) rely on the presence of exchangeable blocks in the time-series for their respective validity guarantees. nexCP (Barber et al. 2022) and therefore nmtCP makes no such assumption and can therefore be applied to time-series that exhibit long term dependencies or sudden distribution shifts.

EnbPI (Xu and Xie, 2021) is computationally more expensive than nmtCP because it trains an ensemble of underlying models. The benefit to this approach is, that it avoids splitting the known examples into a proper training set and a calibration set. In the semi-off-line setting EnbPI behaves like nexCP and therefore, like nmtCP with a weight function that assigns the same weight to a fixed number of past residuals. The weighted quantile function used in nexCP and nmtCP can be interpreted as a generalization of the sliding window approach that Xu and Xie (2021) chose for the integration of new residuals in EnbPI. Because Xu and Xie (2021) built their validity guarantees on assumptions about the residuals, it is difficult to make a comparison to the one presented in Eq. (27). However, nmtCP offers more flexibility to determine how the method reacts to distribution shifts through the weights \(w_i\) in the weighted quantile function, even if the performance of the underlying model varies over time.

ACI (Gibbs & Candès, 2021), being its direct precursor, is quite similar to nexCP. It is capable to adapt to distribution shifts, is computationally efficient, comes with strong validity guarantees and offers the flexibility of the weighted quantile function. Just as nexCP, and in contrast to nmtCP, it lacks the support for multi-target regression.

In comparison to nmtCP, Rolling CI (Feldman et al., 2022) does not need to split the known examples into a proper training set and a calibration set, allowing the underlying model to be fitted to the most recent data points. Rolling CI also includes a training step for the underlying model every time a new label becomes available, making it highly efficient in an on-line setting. It behaves similar to ACI, scaling the size of the prediction region over time. However, for multi-target regression the prediction region is scaled equally in all dimensions. This might not be desirable as not all dimensions necessarily contribute equally to a change in coverage rate. nmtCP does not suffer from this behavior since the intervals for the individual dimensions in the label space are constructed independently and then combined into the prediction region.

Overall nmtCP offers a flexible solution that adapts to distribution shifts in the data. It can be used on top of any multi-target regression model with little to no modifications to the underlying model. The miscoverage rate of nmtCP is bound theoretically. This set of characteristics tailored to multistep-ahead multivariate time-series forecasting is not found in the other methods examined in this article. It comes at the expense of weaker validity guarantees or splitting the data, reducing the effective training set size for the underlying model.

4 Experiments

In this section, we examine the empirical performance of the nmtCP method presented in Sect. 3 on simulated and real-world time-series forecasting tasks and compare it to CF-RNN (Stankevičiūtė et al., 2021). Subsection 4.1 introduces the data sets, and explains the preprocessing steps taken to prepare them for nmtCP. Subsection 4.2 describes how the experiments are conducted and what variants of the method we used. This is followed by the metrics used to evaluate the results in Subsect. 4.3. The results of the experiments are reported in Subsect. 4.4, which we discuss in Subsect. 4.5.

4.1 Data sets

We generate simulated data distributions to analyze our method’s behavior. Inspired by Rolnick et al. (2019), we also chose two data sets from the electricity domain, ELEC2 (4.1.2) and Tétouan City (4.1.3), to test our method.

4.1.1 Simulations

Fig. 1
figure 1

Features 5 and 6 of simulation setting 1, exhibiting a changepoint at time-step 3601

Fig. 2
figure 2

Features 5 and 6 of simulation setting 2. Feature 6 exhibits a changepoint at time-step 3601

Fig. 3
figure 3

Features 5 and 6 of simulation setting 3, exhibiting a distribution drift from at time-step 3201 until 4000

Fig. 4
figure 4

Features 5 and 6 of simulation setting 4. Feature 6 exhibits a distribution drift from at time-step 3201 until 4000

All simulation settings consist of six features and 4800 time-steps. They are generated as follows.

  • Feature 1: \(o_{t,1} = sin(\frac{\pi t}{12}) \quad \forall t \in \{1, \dots , 4800\}\)

  • Feature 2: \(o_{t,2} = sin(\frac{\pi t}{84}) \quad \forall t \in \{1, \dots , 4800\}\)

  • Feature 3: \(o_{t,3} = -sign(sin(\frac{\pi t}{12})) \quad \forall t \in \{1, \dots , 4800\}\)

  • Feature 4: \(o_{t,4} = -sign(92 - \text {mod}(t, 168)) \quad \forall t \in \{1, \dots , 4800\}\)

  • Feature 5: \(o_{t,5} = [o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}]\beta _{t,5} + \epsilon _{t, 5} \quad \forall t \in \{1, \dots , 4800\}\)

  • Feature 6: \(o_{t,6} = [o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}]\beta _{t,6} + \epsilon _{t, 6} \quad \forall t \in \{1, \dots , 4800\}\)

The settings differ in their values for \(\beta _{t,5}\) and \(\beta _{t,6}\) outlined in eqs. (34) to (41) below.

Before feeding the data set to the underlying model, nmtCP, and CFRNN we split it into consecutive windows as shown in Subsect. 3.2.1 with \(T_w = T_o =3,\) \(T_h=3\), and a stride of \(s=3\) between consecutive windows, resulting in 1600 examples. The objects are made up of the first 4 features and the corresponding labels are made up of features 5 and 6. Features 5 and 6 each contain a random noise term \(\epsilon _{t, 5} = \mathcal {N}(0, \frac{1 + \text {mod}(t, 3)}{10})\) and \(\epsilon _{t, 6} = \mathcal {N}(0, \frac{1 + \text {mod}(t, 3)}{10})\) that increases in variance over the span of the prediction horizon for each object. We split the examples along the time axis into a proper training set, a calibration set, and a test set containing 1200, 1200, and 4800 time-steps, i.e. 400, 400, and 800 examples, respectively. Dashed lines represent these splits in Figs. 1, 2, 3 and 4.

We consider four simulation settings described in detail below.

  • Simulation Setting 1: changepoint on all label features introduces a changepoint at \(t = 3600\), i.e. after 1200 examples, affecting both of the label’s features. It uses the following values for \(\beta\).

    $$\begin{aligned} \beta _{t,5} = {\left\{ \begin{array}{ll} {[}0.7, 0.3, 0, 0] ^T & \text {if } t \in \{ 1, \dots , 3600 \} \\ {[}0, 0.7, 0.3, 0]^T & \text {if } t \in \{ 3601, \dots , 4800 \} \end{array}\right. } \end{aligned}$$
    (34)
    $$\begin{aligned} \beta _{t,6} = {\left\{ \begin{array}{ll} {[}0, 0, 0.4, 0.6] ^T & \text {if } t \in \{ 1, \dots , 3600 \} \\ {[}0.6, 0, 0, 0.4]^T & \text {if } t \in \{ 3601, \dots , 4800 \} \end{array}\right. } \end{aligned}$$
    (35)

    Figure 1 displays features 5 and 6 for setting 1. The changepoint is highlighted by a dotted line. The entire data set for setting 1 is represented in figure 38 in the appendix.

  • Simulation setting 2: changepoint on one label feature introduces a changepoint at \(t = 3600\), i.e. after 1200 examples, affecting only feature 6. It uses the following values for \(\beta\).

    $$\begin{aligned} \beta _{t,5} = [0.7, 0.3, 0, 0]^T \quad \forall t \in \{ 1, \dots , 4800 \} \end{aligned}$$
    (36)
    $$\beta _{{t,6}} = \left\{ {\begin{array}{*{20}c} {\left[ {0,0,0.4,0.6} \right]^{T} } & {{\text{if}}\;t \in \left\{ {1, \ldots ,3600} \right\}} \\ {\left[ {0.7,0.3,0.4,0.6} \right]^{T} } & {{\text{if}}\;t \in \left\{ {3601, \ldots ,4800} \right\}} \\ \end{array} } \right.$$
    (37)

    Figure 2 displays features 5 and 6 for setting 2. The changepoint is highlighted by a dotted line.

  • Simulation setting 3: distribution drift on all label features introduces distribution drift starting at \(t = 3200\), i.e. at example 1066, and running until \(t = 4000\), i.e. at example 1066, affecting both of the label’s features. It uses the following values for \(\beta\).

    $$\beta _{{t,5}} = \left\{ {\begin{array}{*{20}l} {\left[ {0.7,0.3,0,0} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {1, \ldots ,3200} \right\}} \hfill \\ {\left[ {0.7,0.3,0,0} \right]^{T} + \frac{{t - 3200}}{{800}}\left[ { - 0.7,0.4,0.3,0} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {3201, \ldots ,4000} \right\}} \hfill \\ {\left[ {0.7,0.3,0,0} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {4001, \ldots ,4800} \right\}} \hfill \\ \end{array} } \right.$$
    (38)
    $$\beta _{{t,6}} = \left\{ {\begin{array}{*{20}l} {\left[ {0,0,0.4,0.6} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {1, \ldots ,3200} \right\}} \hfill \\ {\left[ {0,0,0.4,0.6} \right]^{T} + \frac{{t - 3200}}{{800}}\left[ {0.6,0, - 0.4, - 0.2} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {3201, \ldots ,4000} \right\}} \hfill \\ {\left[ {0.6,0,0,0.4} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {4001, \ldots ,4800} \right\}} \hfill \\ \end{array} } \right.$$
    (39)

    Figure 3 displays features 5 and 6 for setting 2. The start-point and end-point of the distribution drift are highlighted by a dotted line.

  • Simulation setting 4: distribution drift on one label feature introduces distribution drift starting at \(t = 3200\), i.e. at example 1066, and running until \(t = 4000\), i.e. at example 1066, affecting only feature 6. It uses the following values for \(\beta\).

    $$\begin{aligned} \beta _{t,5} = [0.7, 0.3, 0, 0]^T \quad \forall t \in \{ 1, \dots , 4800 \} \end{aligned}$$
    (40)
    $$\beta _{{t,6}} = \left\{ {\begin{array}{*{20}l} {\left[ {0,0,0.4,0.6} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {1, \ldots ,3200} \right\}} \hfill \\ {\left[ {0,0,0.4,0.6} \right]^{T} + \frac{{t - 3200}}{{800}}\left[ {0.7,0.3,0,0} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {3201, \ldots ,4000} \right\}} \hfill \\ {\left[ {0.7,0.3,0.4,0.6} \right]^{T} } \hfill & {{\text{if}}\;t \in \left\{ {4001, \ldots ,4800} \right\}} \hfill \\ \end{array} } \right.$$
    (41)

    Figure 4 displays features 5 and 6 for setting 2. The start-point and end-point of the distribution drift are highlighted by a dotted line.

4.1.2 ELEC2

The ELEC2 data set (Harries, 1999) contains information about the electricity demand and price in the Australian states of New South Wales and Victoria. Measures have been taken every 30 min between May 1996 and December 1998. For the experiments, we chose to use the first 20,000 time steps in the data set and the features representing the electricity demand in New South Wales, the electricity demand in Victoria, and the amount of electricity transferred between the two states. Figure 5 shows this selection and the split between proper training set, calibration set and test set through the vertical dotted lines. The proper training set contains the first 8000 time steps, the calibration set is made up of the following 8000 time steps and the test set comprises the last 4000 time steps.

Fig. 5
figure 5

Features and time steps from the ELEC2 data set (Harries, 1999) used in the experiments

Before feeding the data set to the underlying model, nmtCP, and CF-RNN we split it into consecutive windows as shown in Subsect. 3.2.1 with \(T_w=T_o=192\), \(T_h=12\), and a stride of \(s=12\) between consecutive windows, resulting in 1650 examples and a prediction horizon of six hours. Both the object and the corresponding label are made up of all three features. We split the examples into a proper training set, a calibration set, and a test set containing 660, 660 and 330 examples, respectively, along the time axis. Note that the features VIC and Transfer have a constant value throughout the proper training and the calibration set and display a strong distribution shift in the test set, visible in Fig. 5.

4.1.3 Tétouan City

Fig. 6
figure 6

Features and time steps from the Tétouan City data set (Salam & Hibaoui, 2018) used in the experiments

The Tétouan City data set (Salam & Hibaoui, 2018) contains weather information and the electricity consumption of three different distribution networks of the northern Moroccan city of Tétouan. The measures were taken every 10 min in 2017. For the experiments, we chose to use the first 40,000 time steps in the data set and the features representing the electricity consumption for each of the three distribution networks. Figure 6 shows this selection and the split between proper training set, calibration set, and test set through the vertical dotted lines. The proper training set contains the first 20,000 time steps, the calibration set is made up of the following 5000 time steps and the test set comprises the last 15,000 time steps. The three distribution networks are referred to as Zone 1, Zone 2, and Zone 3 in Fig. 6.

Similar to the ELEC2 data set (Harries, 1999), we split the Tétouan City data set into consecutive windows with \(T_w=144\), \(T_h=6\), and a stride of \(s=6\) between consecutive windows, resulting in 6642 examples and a prediction horizon of one hour. Again, both, the object and the corresponding label are made up of all three features. The proper training set, the calibration set, and the test set contain 3321, 830 and 2491 examples, respectively, split along the time axis. Compared to the ELEC2 data set, the features of the Tétouan City data set exhibit a more gradual distribution shift in the test set, shown in Fig. 6.

4.2 Experimental setup

The experiments follow the process outlined in Subsect. 3.2.2.

Step 1: Training. We chose linear regression for the underlying model, using PyTorch for the implementation and the Adam optimizer during training. While producing similar results in this task, compared to the recurrent neural networks used in Stankevičiūtė et al. (2021) and Schlembach et al. (2022), the linear regression model is much faster to train and allowed for multiple repetitions of every experiment. For both data sets, we train the underlying models on the proper training set for 1000 epochs, using a batch size of 100 and a learning rate of 0.01.

Step 2: Initial calibration. After the underlying model is fitted to the proper training set, we use the model to generate predictions on the calibration set. Then, the nexCPs (Barber et al., 2022) use these predictions to calibrate for every dimension in the label space. To increase efficiency, we calibrate multiple groups of nexCPs with different weights for the quantile functions simultaneously. Subsection 4.2.1 provides a detailed description of the different functions generating these weights. In addition to increasing efficiency, sharing the same underlying model between multiple instances of nmtCP, with different weights for the quantile function, has the added benefit that, when comparing the validity and efficiency of the produced prediction regions, they are based on the same residuals.

Step 3: Inference. Finally, we present the underlying model with the objects from the test set sequentially in their original order. The prediction is given to each group of calibrated nexCPs for each to generate a prediction region in the form of a hyper-rectangle. Then we compute the corresponding residual for every dimension in the label space using the true label and add the resulting values to the calibration set of each nmtCP.

After computing all examples in the test, set we evaluate them using the metrics described in Subsect. 4.3.Footnote 8 All experiments are repeated 20 times and results averaged to account for the random initialization of the underlying model.

4.2.1 Weight functions

Fig. 7
figure 7

Normalized weights produced by different weight functions for the past 300 examples

In the semi-off-line setting, the number of residuals used by the weighted quantile function in every nexCP (Barber et al., 2022) grows with every new true label \(y_{i-1}\) that becomes available. Therefore, instead of assigning static weights \(\tilde{w}_i\), we used functions to generate the appropriate number of weights every time a new object \(x_{i}\) is processed. In the experiments, we compared the following weight functions.

Exponential The weight of past residuals decreases exponentially the further they are in the past.

$$\begin{aligned} W(i - m_0) = [e^{\beta (j - i + m_0)}]_{j=0}^{i - m_0} \end{aligned}$$
(42)

where \(\beta\) is a parameter controlling how quickly the value of past weights decreases. In the experiments, \(\beta = 0.007\).

Soft cutoff This weight function mimics a sliding window approach with a soft transition from the past residuals that are included to the ones that are excluded.

$$\begin{aligned} W(i - m_0) = \left[ \frac{j - i + m_0 + \beta _c}{\beta _s + |j - i + m_0 + \beta _c|} + 1\right] _{j=0}^{i - m_0} \end{aligned}$$
(43)

where \(\beta _c\) is the cutoff point, indicating how many past residuals are considered and \(\beta _s\) controls the “softness” of the transition. In the experiments \(\beta _c = 200\) and \(\beta _s = 50\).

Linear The weight of past residuals decreases linearly the further they are in the past.

$$\begin{aligned} W(i - m_0) = \left[ \frac{j}{i - m_0}\right] _{j=0}^{i - m_0} \end{aligned}$$
(44)

Constant All past residuals are given the same constant weight.

Figure 7 illustrates the weights these functions generate for 300 past residuals.

4.3 Evaluation

To evaluate the method’s validity in the experiments, we used two metrics, the mean coverage rateFootnote 9 and the rolling coverage rate. Following the definition of validity given in Subsect. 2.1.1, the coverage rate computes the fraction of prediction regions \(\hat{\textbf{y}}_{i}\) that contain the true label \(y_i\) for a given confidence level \(1 - \alpha\). If the coverage rate is equal or greater than \(1 - \alpha\), the method is valid. The rolling coverage rate applies the same principle using a sliding window approach hence providing a local coverage rate.

The mean interval width provides information about the efficiency of the method for a given confidence level \(1 - \alpha\). In the multistep-ahead multivariate time-series forecasting setting, it computes the average difference between the upper and the lower bound of the prediction interval for every dimension in the label space and every example in the test set. A lower value therefor indicates greater information efficiency.Footnote 10

4.4 Results

The experimental results are grouped by data set.

4.4.1 Simulations

Fig. 8
figure 8

Mean coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Average taken over 20 trials

Fig. 9
figure 9

Mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Average taken over 20 trials

Fig. 10
figure 10

Rolling coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 11
figure 11

Rolling mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 12
figure 12

Rolling coverage rate by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 13
figure 13

Rolling mean interval width by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 1. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 14
figure 14

Mean coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Average taken over 20 trials

Fig. 15
figure 15

Mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Average taken over 20 trials

Fig. 16
figure 16

Rolling coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 17
figure 17

Rolling mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 18
figure 18

Rolling coverage rate by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 19
figure 19

Rolling mean interval width by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 2. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 20
figure 20

Mean coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Average taken over 20 trials

Fig. 21
figure 21

Mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Average taken over 20 trials

Fig. 22
figure 22

Rolling coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 23
figure 23

Rolling mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 24
figure 24

Rolling coverage rate by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 25
figure 25

Rolling mean interval width by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 3. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Simulation setting 1: changepoint on all label features shows how the method behaves when presented with a sudden change in the data’s distribution. In this setting, nmtCP remains globally valid when using the exponential or the soft cutoff weight functions. nmtCP remains almost valid in combination with the linear weight function, but loses validity for most confidence levels when all residuals are given the same weight. Figure 8 shows that in all instances, nmtCP outperforms CF-RNN by realizing a greater coverage rate for all values of the confidence level \(1-\alpha\). CF-RNN does not utilize new residuals when they become available. Figure 9 shows that nmtCP achieves this boost in validity by increasing interval widths of the predicted region compared to CF-RNN.

Both, the global coverage rate and the mean interval width are large for small values of the confidence levels \(1 - \alpha\) resulting in all methods being valid for \(1 - \alpha \le 0.2\). Figure 9 shows that the mean interval width increases slowly with the confidence levels \(1 - \alpha\).

A coverage rate \(\ge 0.95\) is only achieved through infinite prediction intervals. This behavior cannot be represented in Fig. 9.

Fig. 26
figure 26

Mean coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Average taken over 20 trials

Fig. 27
figure 27

Mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Average taken over 20 trials

Fig. 28
figure 28

Rolling coverage rate for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 29
figure 29

Rolling mean interval width for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 30
figure 30

Rolling coverage rate by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Fig. 31
figure 31

Rolling mean interval width by feature for nmtCP with different weight functions and CF-RNN on the test set of simulation 4. Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

While Figs. 8 and 9 provide a global view of the behavior of nmtCP and CF-RNN with different weight functions W, the rolling coverage rate presented in Fig. 10 shows how the coverage rate changes locally. With a window size of 50 it computes the coverage rate over the past 50 examples. Once the distribution shift occurs, nmtCP initially loses validity for all employed weight functions W. When using the exponential weight function and the soft cutoff weight function, the method recovers quickly and remains approximately valid for the remaining examples in the test set. nmtCP achieves this by quickly increasing the width of the prediction regions, as shown in Fig. 11. In combination with the linear weight function and the constant weight function, nmtCP takes much longer to recover and is not able to regain validity consistently after the change point. With these weight functions the mean interval width grows slower. Figure 10 also shows that for CF-RNN only a small number of the real labels \(y_i\) are within the prediction region after the distribution shift. It does not adapt the size of the prediction region over time.

Figs. 12 and 13 show the rolling coverage rate and the rolling mean interval width for the two features of the label space. These metrics are aggregated over all time-steps in the prediction horizon \(T_h\). We observe, that the method’s behavior for each feature in the label space is similar to its behavior for both features combined. Both, per feature and for both features combined, the weight functions that achieve higher coverage rates are the ones that adapt faster and grow the prediction interval’s size more quickly.

Simulation setting 2: changepoint on one label feature only varies the distribution of one of the label’s features. Figures 14 and 15 show that the method’s general behavior is similar to its behavior in Simulation Setting 1 with respect to the influence of the chosen weight functions.

Compared to simulation setting 1, the behavior differs when looking at the contributions of the individual features of the label space to the rolling coverage rate and the rolling mean interval width, presented in Figs. 18 and 19. We observe that only \(y_{1:800,2}\), the feature containing the changepoint, displays changes in both metrics. This influences nmtCP’s performance when measured for all dimensions of the label space as shown in Figs. 16 and 17. The rolling coverage rate and the rolling mean interval width remain stable for \(y_{1:800,1}\), the feature who’s distribution does not change.

Similar to simulation setting 1, the weight functions that achieve higher coverage rates are the ones that adapt faster and grow the prediction interval’s size more quickly.

Simulation setting 3: distribution drift on all label features shows how the method behaves when presented with a progressive change in the data’s distribution. Remaining fixed in the first third of the test set, the distribution of both features of the labels space drifts linearly during the second third of the test set and is fixed again during the remainder of the test set.

In contrast to the previous settings, nmtCP only remains globally valid when using the exponential weight function. Figure 20 shows that in combination with other weight functions, nmtCP loses its validity for most confidence levels, as does CF-RNN. Figure 21 shows again that higher coverage rates are achieved by increasing the interval widths of the predicted region.

The rolling coverage rate in Fig. 22 shows that the method drops below the targeted coverage rate soon after the distribution drift starts for all weight functions. nmtCP only manages to recover and stay above the targeted coverage rate for the exponential weight function and the soft cutoff weight function and only after the distribution is no longer shifting. Figure 23 shows that mean interval width also continues growing after the distribution is no longer shifting.

Similar to simulation setting 1, we observe that the method’s behavior for each feature in the label space is similar to its behavior over all. Figures 24 and 25 show the rolling coverage rate and the rolling mean interval with for each feature of the label space. Again, the weight functions that achieve higher coverage rates are the ones that adapt faster and grow the prediction interval’s size more quickly.

Simulation setting 4: distribution drift on one label feature progressively changes the distribution of one of the label’s features. Remaining fixed in the first third of the test set, the distribution of one of the features in the label space drifts linearly during the second third of the test set and is fixed again during the remainder of the test set. The other feature in the label space is not affected by this drift and retains its original distribution throughout the test set. Figures 26 and 27 show that the method’s general behavior is similar to its behavior in Simulation Setting 3 with respect to the influence of the chosen weight functions.

Looking at the metrics’ evolution over time in Figs. 28 and 29 it is also similar to the situation encountered in simulation setting 3. For all weight functions the method drops below the targeted coverage rate soon after the distribution drift starts. nmtCP only manages to recover and stay above the targeted coverage rate for the exponential weight function and the soft cutoff weight function and only after the distribution is no longer shifting. The rolling mean interval width also continues growing after the distribution is no longer shifting for all weight functions.

Compared to simulation setting 3, only \(y_{1:800,2}\), the feature experiencing distribution drift, displays changes in the rolling coverage rate and the rolling mean interval width. It is also the only feature for which the mean interval width grows over time. The rolling coverage rate and the rolling mean interval width remain constant for \(y_{1:800,1}\). For this feature, the distribution does not change. Figures 30 and 31 show this behavior.

4.4.2 ELEC2

Fig. 32
figure 32

Mean coverage rate for nmtCP with different weight functions and CF-RNN measured on the ELEC2 data set (Harries, 1999). Average taken over 20 trials

Fig. 33
figure 33

Mean interval width for nmtCP with different weight functions and CF-RNN measured on the ELEC2 data set (Harries, 1999). Average taken over 20 trials

Fig. 34
figure 34

Rolling coverage rate for nmtCP with different weight functions and CF-RNN on the test set of the ELEC2 data set (Harries, 1999). Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Table 2 Mean coverage rate for nmtCP with different weight functions and CF-RNN measured on the ELEC2 data set (Harries, 1999). Average taken over 20 trials
Table 3 Mean interval width for nmtCP with different weight functions and CF-RNN measured on the ELEC2 data set (Harries, 1999). Average taken over 20 trials

This subsection presents the results obtained from applying nmtCP to the subset of the ELEC2 data set (Harries, 1999) described in Subsect. 4.1.2. Due to the nature of the data set, this set of experiments showcases how the method behaves in the presence of a strong distribution shift comparable to a changepoint.

In these adverse conditions, nmtCP remains globally valid when using the exponential or the soft cutoff weight functions. nmtCP remains almost valid in combination with the linear weight function, but loses validity for most confidence levels when all residuals are given the same weight. Table 2 shows that in all instances, nmtCP outperforms CF-RNN (Stankevičiūtė et al., 2021) by realizing a greater coverage rate on average. This behavior can be observed in Fig. 32. It shows the coverage rate of the prediction regions generated by nmtCP on the test set for different confidence levels \(1-\alpha\). In addition to the results of nmtCP with different weight functions W, it also displays the coverage rate measured for CF-RNN, which does not utilize new residuals when they become available.

Table 3 shows that nmtCP achieves this boost in validity by increasing interval widths of the predicted region compared to CF-RNN Stankevičiūtė et al. (2021). Figure 33 displays the mean prediction interval width for different confidence levels \(1-\alpha\) corresponding to the coverage rates in Fig. 32.

Both, the global coverage rate and the mean interval width are large for small values of the confidence levels \(1 - \alpha\) resulting in all methods being valid for \(1 - \alpha \le 0.2\). The mean interval width increases slowly with the confidence levels \(1 - \alpha\).

Tables 2 and 3 also show that a coverage rate of 1 is only achieved through infinite prediction intervals. This behavior cannot be represented in Fig. 33.

While Tables 2 and 3 and Figs. 32 and 33 provide a global view of the behavior of nmtCP with different weight functions W, the rolling coverage rate shown in Fig. 34 shows how the coverage rate changes locally. With a window size of 50 it computes the coverage rate over the past 50 examples. Once the distribution shift occurs, only nmtCP using the exponential weight function remains valid for a targeted coverage rate of \(1 - \alpha = 0.8\). It does so by continuously producing infinite intervals. In combination with the soft cutoff weight function the method loses validity briefly, remaining approximately valid for the remaining time steps of the test set. In combination with the linear weight function and the constant weight function, the nmtCP is not able to regain validity consistently after the change point. Figure 34 also shows that for CF-RNN (Stankevičiūtė et al. 2021) none of the real labels \(y_i\) are within the predicted region after the distribution shift.

4.4.3 Tétouan City

Fig. 35
figure 35

Mean coverage rate for nmtCP with different weight functions and CF-RNN measured on the Tétouan City data set (Salam & Hibaoui, 2018). Average taken over 20 trials

Fig. 36
figure 36

Mean interval width for nmtCP with different weight functions and CF-RNN measured on the Tétouan City data set (Salam & Hibaoui, 2018). Average taken over 20 trials

Fig. 37
figure 37

Rolling coverage rate by feature for nmtCP with different weight functions and CF-RNN on the test set of the Tétouan City data set (Salam & Hibaoui, 2018). Rolling window size \(= 50\), target coverage rate \(1-\alpha = 0.8\), average taken over 20 trials

Table 4 Mean coverage rate for nmtCP with different weight functions and CF-RNN measured on the Tétouan City data set (Salam & Hibaoui, 2018). Average taken over 20 trials
Table 5 Mean interval width for nmtCP with different weight functions and CF-RNN measured on the Tétouan City data set (Salam & Hibaoui, 2018). Average taken over 20 trials

This subsection presents the results obtained from applying nmtCP to the subset of the Tétouan City data set (Salam & Hibaoui (2018) described in Subsect. 4.1.3. In contrast to the results presented in Subsect. 4.4.2, this set of experiments shows how the method behaves in the presence of a slow distribution shift.

Globally all models are valid for all confidence levels \(1 - \alpha\). nmtCP with the exponential weight function, the soft cutoff weight function and the linear weight function are the models with the highest coverage rate closely followed by CF-RNN (Stankevičiūtė et al., 2021) and nmtCP with constant weights. Table 4 and Fig. 35 show these results. They also show that the difference in coverage rate between the variants of the method is small and that their coverage rate is above the targeted coverage rate, especially towards the lower end of the confidence level \(1 - \alpha\).

While in Subsect. 4.4.2 models with larger coverage rate also produced larger prediction intervals, this is not the case for this set of experiments. nmtCP paired with the constant weight function has both the lowest coverage rate and produces the smallest prediction regions. However CF-RNN has a similar coverage rate while producing the largest prediction regions. Table 5 and Fig. 36 also indicate that for \(1-\alpha = 0.95\) all of them produce infinitely large prediction regions while nmtCP with the exponential weight function already does so for \(1-\alpha = 0.9\).

The three models with the lowest coverage rate, namely nmtCP with the constant weight function, CF-RNN and nmtCP with the linear weight function all lose their coverage rate temporarily as shown in Fig. 37. For most of the time all models reach a coverage rate that is substantially above the targeted coverage rate of \(1-\alpha = 0.8\).

4.5 Discussion

The results presented in Subsects. 4.4.1 and 4.4.2 show that nmtCP is able to quickly adapt to strong distribution shifts such as changepoints, where CF-RNN (Stankevičiūtė et al., 2021) loses its validity. Subsection 4.4.1 also shows that nmtCP takes longer to recover from a local loss of validity when the distribution shifts progressively. Subsection 4.4.3 showed that nmtCP can produce smaller, more efficient prediction regions when both methods are valid. In all conducted experiments, nmtCP retains its global validity when the right weight functions W are chosen. These experiments confirm that nmtCP is generally suited for the multistep-ahead multivariate time-series forecasting task.

The experiments also show that the global validity, the loss of local validity, the speed of recovery of local validity and the mean interval width all depend on the weight function W that is used. The use of the exponential weight function consistently leads to the highest coverage rates, followed by the soft cutoff weight function. With the chosen parameters for both weight functions, they place larger weights on recent residuals while discounting older ones heavily. This allows nmtCP to quickly adapt to distribution shifts. The drawback of these weight functions is that by heavily discounting a large number of old residuals they reduce the effective sample size of past residuals Barber et al. (2022) leading to larger steps in the weighted quantile function \(\textsf{Q}\). This contributes to a larger number of uninformative intervals of infinite size for high confidence levels \(1 - \alpha\). The other two weight functions suffer less from this phenomenon. Discounting older residuals less quickly leads to a slower recovery after a local loss of validity which costs them their global validity in the experiment presented in Subsects. 4.4.1 and 4.4.2. Hence, the choice of the weight function and thereby the speed and severity with which to discount older residuals presents the method’s user with a classic bias-variance trade-off.

A notable limitation of nmtCP is its inability to recover from a local loss of validity during a progressive distribution drift, no matter which of the weight functions presented in Subsect. 4.2.1 is chosen. Given the adversarial simulation settings 3 and 4, the quality of the underlying model’s predictions worsens during the distribution drift. This means that the nonconformity scores for the dimensions affected by the distribution drift of new example will, on average, be larger than previous than their counterparts of previous examples. Ideally, past nonconformity scores of a particular dimension with a similar magnitude as the new one would be assigned larger weights and those with a dissimilar magnitude would be assigned lower weights to limit the coverage gap in every dimension of the label space. However, none of the selected weight functions with their fixed shape allow for such a dynamic assignment of the weights. This means that, with the weight functions presented in Subsection 4.2.1, the speed at which the prediction region grows in any dimension of the label space cannot surpass the speed at which the nonconformity scores grow in any dimension of the label space. Assuming there were a mechanism to assign the ideal weight to past nonconformity scores, there might simply not be any nonconformity scores that are large enough. If that is the case for even just one dimension in the label space, it would force the imaginary mechanism to assign the largest possible weight to \(\delta _{r_{+\infty }}\) in that dimension’s quantile function, resulting in a prediction region of infinite size with limited information provided to the method’s user. One potential possibility to overcome this limitation is to replace the individual nexCPs in nmtCP with an adaptive method such as ACI, wich also supports a weighted quantile function.

The proof presented in Subsect. 3.3 guarantees the method’s validity through an upper bound of the miscoverage rate by relying on the Bonferroni correction. It does, however, not provide information about the method’s efficiency. The results of the experiments conducted on the real world data sets show that nmtCP is often too conservative, producing prediction regions exceeding the targeted coverage rate, especially for low confidence levels \(1 - \alpha\). While (almost) assumption free validity is the main appeal of the conformal prediction methods, inefficient intervals are undesirable as they underestimate the underlying model’s confidence. This conservative behavior of nmtCP is, in the worst case, another factor contributing to a larger number of uninformative intervals of infinite size for high confidence levels \(1 - \alpha\).

To investigate the influence of the correction method for the family-wise error rate on the method’s efficiency and its conservative behavior we conducted an additional experiment that uses the portion of the ELEC2 data set (Harries, 1999) described in Subsect. 4.1.2. However, instead of comparing different weight functions W for the weighted quantile function \(\textsf{Q}\), it compares different correction methods for the family-wise error rate. All trials use the soft cutoff weight function presented in Subsect. 4.2.1. The correction methods are

  • The Bonferroni correction shown in Equation (25), replicating the results from Subsect. 4.4.2;

  • The Šidák Correction, assuming that the errors occur independently in every dimension of the label space, resulting in \(\alpha ' = 1 - (1-\alpha )^{T_h \cdot F_h}\), and;

  • No correction at all with \(\alpha ' = \alpha\).

Subsection B.1 in the appendix contains the results for this experiment. Table 6 shows that with no correction at all, nmtCP is not valid, except in the extreme case where the produced interval is of infinite size and therefor uninformative. Using the Šidák Correction, nmtCP remains valid on average for all confidence levels \(1 - \alpha\) and compares favorably to the Bonferroni correction by being more efficient. Figure 39a shows that the coverage rate remains closer to the targeted coverage rate, especially for lower confidence levels \(1 - \alpha\) and Fig. 39b confirms that the produced intervals are smaller on average. For \(1 - \alpha = 0.8\), the two methods are remarkably similar as shown in Fig. 39c. Due to their similarity for high confidence levels \(1 - \alpha\), the use of the Šidák Correction does not reduce the generation of uninformative intervals of infinite size.

In addition to being a valid method for generating prediction regions for multistep-ahead multivariate time-series forecasting, nmtCP also provides the user with tools to investigate the contributions of every dimension in the label space to the miscoverage rate and the average interval width. This was already hinted to in Subsect. 4.3. Not only can individual dimensions be scrutinized, but they can also be aggregated. Figures 12, 13, 18, 19, 24, 25, 30, 31, in Subsect. 4.4.1 are examples for this. They allow for the observation that in the Simulation Settings, only the coverage rate and the mean interval width of the features experiencing changes in the distribution are affected by those changes. Fig. 40 in the appendix is another example of this, showing the rolling coverage rate for every feature of the results presented in Subsect. 4.4.2. These detailed insights are easily obtained for nmtCP because the prediction intervals are individually generated and expressed for every dimension in the label space and aggregated to a common prediction region after their generation. This transparency comes at the cost of depriving “the conformal model of profiting from the possible correlations between the different outputs”, as Messoudi et al. (2020) note.

5 Conclusion

This article examined the task of conformal multistep-ahead multivariate time-series forecasting. It investigates the difficulties associated with applying conformal prediction to non-exchangeable data and multi-target regression. The proposed method, nmtCP produces prediction regions for multistep-ahead multivariate time-series forecasting tasks. nmtCP, is computationally efficient and relatively easy to implement, requiring no modifications to the underlying model. For this method, a theoretical analysis showed that it has desirable theoretical properties by proving that the miscoverage rate has an upper bound that depends on the chosen hyperparameters. Because this proof does not rely on the exchangeability of the examples in the data set or a symmetric model fitting algorithm for the underlying model, the method is suitable for time-series applications. In addition, the method’s architecture allows for an easy investigation of its behavior. This is possible because the metrics used to evaluate the method can be applied to each dimension in the label space individually or be aggregated over multiple dimensions.

The experimental results on four synthetic and two public data sets from the electricity domain serve to validate nmtCP’s theoretical properties and provide insights into the method’s behavior. They show that even in adverse conditions the method retains its validity when paired with the right parameters. It recovers quickly after a strong distribution shift and is more efficient than a competing method when both are valid.

The experiments also show that the method is often too conservative, especially for small confidence levels, producing prediction intervals that are larger than they need to be. Given the chosen weights on the nonconformity scores,  another limitation of nmtCP revealed by the experiments is the method’s inability to grow the prediction region quickly enough for the duration of a distribution drift.

5.1 Future work

We have identified the following key areas for future research: addressing the method’s overly conservative behavior, providing practical guidance for tuning the weighted quantile function's weights, expanding support to quantile regression models, and developing a benchmarking platform. This section elaborates on these areas, proposes possible next steps, and highlights related research for each. Addressing the method’s overly conservative behavior is a pressing next step. Integrating the work of Messoudi et al. (2021) offer a promising opportunity to do that by leveraging the correlations between the different dimensions in the label space using copulas. Instead of a hyper-rectangle, in their subsequent publication, Messoudi et al. (2022) present a method that produces a multidimensional ellipsoidal uncertainty region. Another possible approach to address the method’s overly conservative behavior is building upon Ajroldi et al. (2023) and Diquigiovanni et al. (2024). If the values for multiple dimensions in the label space are sampled from the same continuous phenomena over a spatial or a temporal domain, Ajroldi et al. (2023) and Diquigiovanni et al. (2024) take a functional time series approach and generate prediction bands for functional data. This eliminates the reliance on family-wise error rate correction methods at the cost of predicting functions instead of matrices.

While the weights w of the weighted quantile function \(\textsf{Q}\) offer the user a lot of flexibility to tune nmtCP, we provide no concrete advice on how to exploit this opportunity. Based on the suggestion made by Barber et al. (2022), an intuitive interpretation of the task assumes that newer residuals should carry a higher weight. Additional experiments, such as the ones presented by Guan (2023) and Hore and Barber (2023) are required to offer more practical guidance to the user on how to set  the weights.

In its current form, nmtCP only supports underlying models that produce point forecasts. It might be of interest to explore if the advances made for quantile regression used in (Barber et al., 2022; Gibbs and Candès, 2021) can be applied to nmtCP. This would add support for underlying quantile regression models.

Finally, to test and compare nmtCP and future methods, constructing a set of synthetic benchmarks that simulate different scenarios such as change points in the distribution or slow distribution shifts in the data occurring for all or a subset of the features would be useful. These benchmarks should ideally come with a set of metrics that allow for an objective and comparable quantification of the method’s properties, such as the time it takes for the method to recover after falling below the targeted coverage rate. They should also go beyond the adversarial approach taken in this article and put greater emphasis on the information efficiency of a tested method. Applying nmtCP to a larger set of real-world data sets, such as the one compiled by Messoudi et al. (2020) might also help to further understand the method’s behavior and provide the user with practical guidance.