1 Introduction

Long-term online prediction is an essential requirement in practical applications, such as power generation [1], traffic flow [2], air quality sensing [3], and temperature prediction [4]. For complex system modeling, large numbers of parameters need to be confirmed and the interactive learning process usually costs intolerable computational overhead. In long-term online learning process, the neural network cannot always be rebuilt through trial and error caused by user intervention. Therefore, how to adjust the model and parameters depends on the arrived data has been a hot research [5].

Aiming at regression problem, the key target is fitting. Absolutely the best training model for all datasets does not exist, and fitting deviation cannot be avoided [6]. Compared to underfitting, the overfitting is a hard nut to crack, because we must evaluate the overfitting degree at the test stage [7]. Punishment constraint is added in the candidate model so as to prevent the overfitting problem [8]. Due to the asymptotic convergence property, AIC is frequently used for the small sample [9]. In online scenario, dataset is arriving one by one or chunk by chunk. The regularity of input characteristics may be changed as time goes on, and the gradient descent algorithm will influence the predictive efficiency. As a single-hidden-layer feedforward network (SLFN), Extreme Learning Machine (ELM) needs no iterations, which is widely used on the classification and regression problem [10]. The dynamic neural network architecture extends the original ELM learning strategy to make it adequate for the online sequence (OS-ELM) [11]. Some novel ELMs raise hidden node number with arriving data. They have not only low computational overhead, but also remarkable capabilities for nonlinear mappings [12]. To avoid enlarging network scale without restriction, the incremental ELM (I-ELM) set a termination condition using the expected learning accuracy and maximum hidden node number [13]. Then the convex I-ELM (CI-ELM) [14] achieves faster convergence rate on account of adjusting the output weight, and the error minimized ELM (EM-ELM) [15] moves toward the predictive target with varying length hidden nodes. In order to make the neural networks more compact, the adaptive ELM [16] and the sparse ELM [17] has been proposed. These pruning methods all generate succinct links between different layer nodes. While they are inevitable to bear the high computational complexity due to the estimation of probability density [18].

So far, the most representative dynamic ELM for long-term online prediction is the D-ELM [19], it combines the EM-ELM and AG-ELM [20]. The method has a huge advantage on data fitting when the target function is not explicit. To optimize multiple parameters, some researchers bring in biological intelligent optimization method to acquire the best topologies of neural network [21,22,23]. But the batch learning processes are not suitable for online data stream, especially in the long-term prediction. Han et al. [24] presents a dynamic ELM based on particle swarm optimization(PSO) [25] to obtain the appropriate network parameters. Training error and condition performance are considered at the same time. Although the above online forecast methods can get local optimal model, they all depend on the original definition, including the maximum hidden node number and the ultimate iteration steps. If the overfitting and underfitting cannot be got rid of timely, generalization performance will become from bad to worse. It will lead to the long-term prediction model out of work.

In this paper, a novel dynamic ELM with balanced variance and bias (VB-ELM) has been presented. Expanded the traditional OS-ELM, the proposed method can acquire the beneficial parameters quantificationally. Relying on the degree of overfitting or underfitting, VB-ELM structure can be optimized automatically. Also, to reduce the approximation error, the spatial-temporal features are considered in the update process. Note that both model selection and constrained optimization have an impact on fitting degree, the proposed method deal with hidden node number and regularization parameter simultaneously. The whole forecast process can be done without manual trial and error operation, which have significant application value for long-term online prediction.

The remainder of this paper is organized as follows. In Sect. 2, we describe the preliminaries involving EM-ELM and approximation error theory. In Sect. 3, a new ELM using balanced variance and bias is introduced. Experimental results and discussion are conducted in Sect. 4. Finally, the conclusion is given in the last section.

2 Preliminaries

ELM has low computational complexity and easy hardware implementation [26]. It makes researchers come up with improved ELMs in the special applications, especially for big data processing and high dimensional prediction [27]. In this section, the preliminaries of the EM-ELM and the relationship between approximation error and hidden node number are described.

2.1 Brief on EM-ELM

EM-ELM is a modified ELM which adds hidden nodes with different size. With growth of the neural network, the output weight is updated owing to error minimization [28]. Suppose that the sample data is \(\{ (\mathbf{{x}}_i ,y_i )\} _{i = 1}^N\), then the basic mapping relation can be expressed as

$$\begin{aligned} y_i = \sum \limits _{p = 1}^L {\beta _p \phi (\mathbf{{a}}_p \cdot \mathbf{{x}}_i + \mathbf{{b}}_p )},\quad i = 1, \ldots ,N \end{aligned}$$
(1)

where \(\phi (\cdot )\) is activation function, L is the number of hidden node, \(\mathbf{{a}}_p\) and \(\mathbf{{b}}_p\) are the input weight vector and offset, respectively. The only goal to confirm network structure is to determine the output learning parameters \({\varvec{\beta }}\). Therefore, Eq. (1) can be rewritten in a more concise expression as

$$\begin{aligned} \mathbf{{Y}} = \mathbf{{H}}{\varvec{\beta }} \end{aligned}$$
(2)

where \(\mathbf{{H}} = \left[ {\begin{array}{lll} {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_1 + b_1 )} &{} \ldots &{} {\phi (\mathbf{{a}}_L \cdot \mathbf{{x}}_1 + b_L )} \\ \vdots &{} \ldots &{} \vdots \\ {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_N + b_1 )} &{} \ldots &{} {\phi (\mathbf{{a}}_L \cdot \mathbf{{x}}_N + b_L )} \\ \end{array}} \right] _{N \times L}\), \({\varvec{\beta }} = [\beta _1 ,\beta _2 ,\ldots ,\beta _L ]^T\) and \(\mathbf{{Y}} = \left[ {y_1 ,y_2 ,\ldots ,y_N } \right] ^T\).

Here \(\mathbf{{H}}\) denotes the hidden-layer output matrix of this network. According to the least approximation error, the output weights can be determined as follows

$$\begin{aligned} {\varvec{\beta }} = \mathbf{{H}}^\dag \mathbf{{Y}} \end{aligned}$$
(3)

where \(\mathbf{{H}}^{\dag }\) is the Moore–Penrose generalized inverse of \(\mathbf{{H}}\), which is

$$\begin{aligned} \mathbf{{H}}^{\dag } = (\mathbf{{H}}^{\mathrm{T}} \mathbf{{H}})^{ - 1} \mathbf{{H}}^T, \end{aligned}$$
(4)

For EM-ELM, the target error \(\varepsilon \) and the initial hidden node number \(L_0\) are predefined. Firstly, the output weight matrix can be calculated by \({\varvec{\beta }}_1 = \mathbf{{H}}_{1}^{\dag } \mathbf{{Y}}\). If the approximation error is greater than \(\varepsilon \), new hidden node is added. \(\delta \mathbf{{H}}_1\) denotes hidden output matrix for the new node, and generalized inverse matrix is calculated as

$$\begin{aligned} \mathbf{{H}}_2^{\dag } = \left( \mathbf{{H}}_2^{\mathrm{T}} \mathbf{{H}}_2\right) ^{ - 1} \mathbf{{H}}_2^{\mathrm{T}} = \left( {\left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] ~\left[ {\mathbf{{H}}_1 ,\delta \mathbf{{H}}_1 } \right] } \right) ^{ - 1} \left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] , \end{aligned}$$
(5)

By substituting matrix \( \mathbf{{A}} = \left[ {\begin{array}{cc} {\mathbf{{A}}_{11} } &{}\quad {\mathbf{{A}}_{12} } \\ {\mathbf{{A}}_{21} } &{}\quad {\mathbf{{A}}_{22} } \\ \end{array}} \right] = \left( {\left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}}\right] ~\left[ {\mathbf{{H}}_1 ,\delta \mathbf{{H}}_1 } \right] } \right) ^{ - 1}\), the \(\mathbf{{H}}_2^\dag \) can be expressed as following block matrix

$$\begin{aligned} \mathbf{{H}}_2^{\dag } = \left[ {\begin{array}{cc} {\mathbf{{A}}_{11} } &{}\quad {\mathbf{{A}}_{12} } \\ {\mathbf{{A}}_{21} } &{}\quad {\mathbf{{A}}_{22} } \\ \end{array}} \right] \left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] = \left[ {\begin{array}{c} {\mathbf{{U}}_1 } \\ {\mathbf{{D}}_1 } \\ \end{array}} \right] , \end{aligned}$$
(6)

Based on the Schur complement, the \(\mathbf{{H}}_2^{\dag }\) can be evolved as

$$\begin{aligned} \mathbf{{D}}_1= & {} \left( {\left( {\mathbf{{I}} - \mathbf{{H}}_1 \mathbf{{H}}_1^{\dag } } \right) \delta \mathbf{{H}}_1 } \right) ^{\dag }, \end{aligned}$$
(7)
$$\begin{aligned} \mathbf{{U}}_1= & {} \mathbf{{H}}_1^\dag \left( {\mathbf{{I}} - \delta \mathbf{{H}}_1^{\mathrm{T}} \mathbf{{D}}_1 } \right) . \end{aligned}$$
(8)

The output weight can be calculated as

$$\begin{aligned} {\varvec{\beta }}_2 = \mathbf{{H}}_2^{\dag } \mathbf{{Y}} = \left[ {\begin{array}{c} {\mathbf{{U}}_1 } \\ {\mathbf{{D}}_1 } \\ \end{array}} \right] \mathbf{{Y}}. \end{aligned}$$
(9)

Similarly, the \((k+1)\)th output weight are updated as follows

$$\begin{aligned} {\varvec{\beta }}_{k + 1} = \mathbf{{H}}_{k + 1}^\dag \mathbf{{Y}} = \left[ {\begin{array}{c} {\mathbf{{U}}_k } \\ {\mathbf{{D}}_k } \\ \end{array}} \right] \mathbf{{Y}}, \end{aligned}$$
(10)

where \(\mathbf{{D}}_k = \left( {\left( {\mathbf{{I}} - \mathbf{{H}}_k \mathbf{{H}}_k^\dag } \right) \delta \mathbf{{H}}_1 } \right) ^\dag \) and \(\mathbf{{U}}_k = \mathbf{{H}}_k^\dag \left( {\mathbf{{I}} - \delta \mathbf{{H}}_k^{\mathrm{T}} \mathbf{{D}}_k } \right) \). The mapping function can be defined by various forms, such as Sigmoid function and Gaussian function. They all make ELM well-adapted.

2.2 Approximation Error Theory

ELM can calculate output weight analytically. With the hidden node growing, training error will keep on decreasing. Considering the original ELM frame, the learning target is minimizing the norm of approximation error as follow

$$\begin{aligned} { error} = {\mathop {\min }\limits _{\varvec{\beta }}} \left\| \mathbf{{H}}{\varvec{\beta }} - \mathbf{{Y}} \right\| , \end{aligned}$$
(11)

Because \({\varvec{\beta }}^*\) is calculated by the Moore–Penrose generalized inverse, it satisfies the necessary condition of minimum norm least-squares solution, namely \(\left\| {{\varvec{\beta }}^*} \right\| _n \le \left\| {\varvec{\beta }} \right\| _n\) [29]. The relationship of approximation error can be written as follows

$$\begin{aligned} \forall {\varvec{\beta }} \in \left\{ {\varvec{\beta }}:\left\| \mathbf{{H}}{\varvec{\beta }} - \mathbf{{Y}} \right\| _m \le \left\| {\mathbf{{Hz}} - \mathbf{{Y}}} \right\| _m ,\forall \mathbf{{Y}} \in R^m,\mathbf{{z}} \in R^n \right\} , \end{aligned}$$
(12)

where \(\left\| \cdot \right\| _n\) and \(\left\| \cdot \right\| _m\) is the norm of \(R^n\), \(R^m\). The \(\left\{ \cdot \right\} \) is the least square solution of the \(\mathbf{{H}}{\varvec{\beta }} =\mathbf{{Y}}\). When hidden nodes number from \(L_1\) to \(L_2\), the new error is smaller or equal to that of the previous ELM. The proof process can be indicated as follows.

$$\begin{aligned} { error}_2= & {} \left\| {\mathbf{{H}}_{L_2 } {\varvec{\beta }}_{L_2 } - \mathbf{{Y}}} \right\| = \left\| {\left[ {\mathbf{{H}}_{L_1 } ,\mathbf{{H}}_{L_2 - L_1 } } \right] {\varvec{\beta }}_{L_2 } - \mathbf{{Y}}} \right\| \nonumber \\\le & {} \left\| {\left[ {\mathbf{{H}}_{L_1 } ,\mathbf{{H}}_{L_2 - L_1 } } \right] \left[ {\begin{array}{c} {\varvec{\beta }}_{L_1 } \\ 0 \\ \end{array}} \right] - {\mathbf{Y }}} \right\| = \left\| \mathbf{{H}}_{L_1} {\varvec{\beta }}_{L_1 } - \mathbf{{Y}} \right\| = { error}_1. \end{aligned}$$
(13)

Because \({\varvec{\beta }}_{L_2}\) is the solution of M–P generalized inverse, it has the minimum norm. Due to the \(\left[ {{\varvec{\beta }}_{L_1 } ,0} \right] ^{\mathrm{T}}\) may be not the minimum norm least squares solution, so the in equation in formula (13) exists. With the hidden node number becomes greater, the decreasing approximation error leads to overfitting, so learning target must be chosen again.

3 The Proposed Method

In this section, we introduce the proposed VB-ELM in detail. The proposed algorithm includes two phases: initialization learning and model update. At initial learning step, only a small portion of training data is used. Then at model update phase, the hidden node number and the regularization parameter can adjust to the fitting degree. Model parameters are confirmed according to the arriving data.

3.1 Initialization Learning Phase

First of all, it is necessary to obtain the original model. Considering the initial \(t_0\) sample chunk \({ Set}_0 = \{(\mathbf{{x}}_i,y_i )\} _{i = 1}^{t_0 \times M}\), \(t_0\)-fold cross validation is employed in the training process. t0-fold cross validation is employed in the training process. The whole training data is divided into \(t_{0}\) part. Each part is selected as the test dataset successively and the other \(t_{0}-1\) parts are corresponding training datasets. \(C_0\) is regularization parameter, which can be confirmed according to the error of \(t_0\)-fold cross validation. Mathematically, for the first fold, the fitting problem turns to be an optimization process as follows

$$\begin{aligned} \begin{array}{l} { Minimize}:{ error}_0 = \frac{1}{2}\left\| {{\varvec{\beta }}_0 } \right\| + C_0 \frac{1}{2}\sum \limits _{i = 1}^{{\left( {t_0 - 1} \right) } \times M} {\left\| {\xi _i } \right\| ^2 } \\ {Subject\,to}:~\mathbf{{h}}\left( {\mathbf{{x}}_i } \right) {\varvec{\beta }}_0 = y_i^{\mathrm{T}} - \xi _i^{\mathrm{T}} ,i = 1, \ldots ,\left( {t_0 - 1} \right) \times M\\ \end{array}, \end{aligned}$$
(14)

where M is the length of each sample, \({\varvec{\beta }}_0\) is the output weight, and \(\mathbf{{h}}\left( {\mathbf{{x}}_i } \right) \) is the mapping vector of hidden layer. What is more, \(y_i\) is the expected output and \(\xi _i\) expresses predicted error. According to the KKT theorem [30], the solution is described as follows

$$\begin{aligned} {\varvec{\beta }}_0 = \mathbf{{H}}_0^\dag \mathbf{{Y}}_0 = \left\{ {\begin{array}{ll} \mathbf{{H}}_0^{\mathrm{T}} \left( {\frac{\mathbf{{I}}}{{C_0 }} + \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{H}}_0 } \right) ^{ - 1} \mathbf{{Y}}_0,\quad { when}~{{\left( {t_0 - 1} \right) } \times M} < L_0 \\ \left( {\frac{\mathbf{{I}}}{{C_0 }} + \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{H}}_0 } \right) ^{ - 1} \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{Y}}_0 ,\quad when\;{{\left( {t_0 - 1} \right) } \times M} \ge L_0 \\ \end{array}} \right. , \end{aligned}$$
(15)

where \(L_0\) is the hidden node number at initial learning step, and

$$\begin{aligned} \mathbf{{H}}_0 = \left[ {\begin{array}{ccc} {\phi (\mathbf{{a}}_1\cdot \mathbf{{x}}_1 + b_1 )} &{}\quad \ldots &{}\quad {\phi (\mathbf{{a}}_{L_0 } \cdot \mathbf{{x}}_1 + b_{L_0 } )} \\ \vdots &{}\quad \ldots &{}\quad \vdots \\ {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_{{\left( {t_0 - 1} \right) } \times M} + b_1 )} &{}\quad \ldots &{}\quad {\phi (\mathbf{{a}}_{L_0 } \cdot \mathbf{{x}}_{{\left( {t_0 - 1} \right) } \times M} + b_{L_0 } )} \\ \end{array}} \right] _{({{\left( {t_0 - 1} \right) } \times M}) \times L_0 }. \end{aligned}$$

Compared with the errors at \(t_{0}\)-fold cross validation, \({\bar{\varvec{\beta }}}_0\) with minimum error is chosen as the forecast model. The average validation error plays a role of tolerant threshold. Thinking about the test sample chunk \(\mathbf{{X}}_1\), the generalization error can be decomposed as [31]

$$\begin{aligned} { error}_1 = { Bias}_1^2 (\mathbf{{X}}_1 ) + { Var}_1 (\mathbf{{X}}_1 ) + \varepsilon ^2, \end{aligned}$$
(16)

where \({ Bias}_1^2 (\mathbf{{X}}_1 )\) is the square of predicted bias, \({ Var}_1 (\mathbf{{X}}_1 )\) is the variance, and \(\varepsilon \) represents the noise. Usually, \(\varepsilon \) is Gaussian White Noise, so the mean value can be regarded as zero. After the test processing, bias and variance in initialization learning phase is shown as

$$\begin{aligned} { Bias}_1 (\mathbf{{X}}_1 )= & {} \mathbf{{H}}_1 {\bar{\varvec{\beta }}}_0 - \mathbf{{Y}}_1, \end{aligned}$$
(17)
$$\begin{aligned} { Var}_1 (\mathbf{{X}}_1 )= & {} \frac{1}{{t_1 }}\sum _{v=1}^{t_1} \left( \mathbf{{H}}_1^v {\varvec{\beta }}_1^v - \mathbf{{H}}_1 {\bar{\varvec{\beta }}}_0 \right) ^2 , \end{aligned}$$
(18)

where \(\mathbf{{H}}_1^v {\varvec{\beta }}_1^v\) is the predicted value in vth fold. When the (\(t_0+1\))th sample chunk arrived, the training data turn to \(t_1\) fold.

3.2 Model Update Phase

In model update phase, hyper-parameters are determined according to the prior chunk. The changing degree of bias or variance decides fitting deviation [8]. In general, bias and variance are opposite. It can be called bias-variance dilemma. The relationship of generalization error, bias and variance is shown in Fig. 1 [32].

Fig. 1
figure 1

The relationship of generalization error, bias and variance

From Fig. 1, we can see that bias is the primary element for generalization error when the fitting ability is insufficient. Then overfitting or underfitting phenomenon is able to be judged as follows

$$\begin{aligned} { Var}_k - { Bias}_k^2 \mathop \gtrless \limits _{H_2}^{H_1}{} { Var}_{k - 1}-{ Bias}_{k - 1}^2, \end{aligned}$$
(19)

where \({ Bias}_k = \left\| {{ Bias}_k (\mathbf{{X}}_k )} \right\| \) and \({ Var}_k = \left\| {{ Var}_k (\mathbf{{X}}_k )} \right\| \) are the mean value of the kth test chunk. \({ Bias}_{k - 1}\) and \({ Var}_{k - 1}\) are mean bias and variance of the (k-1)th test chunk, respectively. If \(H_1\) event is established, overfitting occurs. Otherwise, if \(H_2\) event is established, underfitting exists. According to the paper [33], fitting degree p is considered as follows

$$\begin{aligned} p = \frac{{ Bias}_{k,\max }^v-{ Bias}_k}{{\hat{\gamma }}_{k}- { Bias}_k}, \end{aligned}$$
(20)

where \({ Bias}_{k,\max }^v\) is the least error of vth validation fold, \({ Bias}_k\) is the test error at kth chuck, and \({\hat{\gamma }}_{k}\) is the no-information error rate in current phase. \({ Bias}_{k,\max }^v\) is calculated as follows

$$\begin{aligned} { Bias}_{k,\max }^v = {\mathop {\max }\limits _{t_k}} \left\| \frac{1}{t_k}\left( \mathbf{{H}}_k^v {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_k^v \right) \right\| . \end{aligned}$$
(21)

\({\hat{\gamma }}_{k}\) describes the least error rate in the known dataset, which is estimated as

$$\begin{aligned} {\hat{\gamma }} _k = \left\| {\frac{1}{{t_k^2 }}\sum \limits _{i = 1}^{t_k } {\sum \limits _{j = 1,j \ne i}^{t_k } {(\mathbf{{H}}_j {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_i)}}}\right\| . \end{aligned}$$
(22)

It is obvious that the range of p is from zero to one. When the value is zero, the model and new data fit each other. If p rises one, that means the model has a large deviation. On the basis of p, the regularization parameter threshold for (\(k+1\))th chunk moves as follows

$$\begin{aligned} C_{k + 1}^{{ bound}} = \left\{ {\begin{array}{ll} \left\lfloor {C_k \cdot (1 - p)} \right\rfloor ,&{}\quad { overfitting} \\ \left\lceil {C_k \cdot (1 + p)} \right\rceil ,&{}\quad { underfitting} \\ \end{array}} \right. , \end{aligned}$$
(23)

where \(C_k\) is the current value. \(\left\lfloor \cdot \right\rfloor \) and \(\left\lceil \cdot \right\rceil \) describe an integer which are rounded down and up. The threshold of hidden node number is changed with the following formula

$$\begin{aligned} L_{k + 1}^{{ bound}} = \left\{ {\begin{array}{ll} \left\lfloor {L_k \cdot (1 - p)} \right\rfloor ,&{}\quad { overfitting} \\ \left\lceil {L_k \cdot (1 + p)} \right\rceil ,&{}\quad { underfitting} \\ \end{array}} \right. , \end{aligned}$$
(24)

where \(L_k\) is the current threshold value. Then the joint optimal solution is got by PSO method as follows

$$\begin{aligned} V_{L_i ,C_j }^{n + 1}= & {} V_{L_i ,C_j }^n + c_1 \times r_1 \times \left[ {B_{L_i ,C_j }^n - P_{L_i ,C_j }^n } \right] + c_2 \times r_2 \times \left[ {B_{L_g ,C_j } - P_{L_i ,C_j }^n } \right] ,\end{aligned}$$
(25)
$$\begin{aligned} P_{L_i ,C_j }^{n + 1}= & {} P_{L_i ,C_j }^n + V_{L_i ,C_j }^{n + 1} ,\quad L_{\min } \le L_i \le L_{\max } ,C_{\min } \le C_j \le C_{\max }, \end{aligned}$$
(26)

where V and P express velocity and position. n is the iteration number. \(B_{L_i ,C_j }^n\) is the nth personal best particle. \(B_{L_g ,C_j }\) is group optimal location. \(L_{\min } = \min \left\{ {L_{k + 1}^{{ bound}} ,L_k } \right\} \), \(L_{\max } = \max \left\{ {L_{k + 1}^{{ bound}} ,L_k } \right\} \), \(C_{\min } = \min \left\{ {C_{k + 1}^{{ bound}} ,C_k } \right\} \), and \(C_{\max } = \max \left\{ {C_{k + 1}^{{ bound}} ,C_k } \right\} \). \(r_1\) and \(r_2\) are uniform random numbers at \(\left[ {0,1} \right] \). \(c_1\) and \(c_2\) are positive constants. The fitness function is calculated as follows

$$\begin{aligned} { fitness} = \left\| {\mathbf{{H}}_k {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_k } \right\| . \end{aligned}$$
(27)

The update formula for getting the best parameter group is shown as follows

$$\begin{aligned} B_{L_i ,C_j }^{n + 1}= & {} \left\{ \begin{array}{ll} P_{L_i ,C_j }^n,&{}\quad { if}\quad { fitness}\left( {P_{L_i ,C_j }^n } \right) < { fitness}\left( {B_{L_i ,C_j }^n } \right) \\ B_{L_i ,C_j }^n,&{}\quad { else} \\ \end{array} \right. ,\end{aligned}$$
(28)
$$\begin{aligned} B_{L_g ,C_j}= & {} \min \left\{ {B_{L_{\min } ,C_j }^{n + 1} ,B_{L_i ,C_j }^{n + 1} , \ldots ,B_{L_{\max } ,C_j }^{n + 1} } \right\} . \end{aligned}$$
(29)

After acquired the optimal fitness, the corresponding particles is \(L_i ^\prime \) and \(C_i ^\prime \). We can confirm \(L_{k + 1} = L_i ^\prime \) and \(C_{k + 1} = C_i ^\prime \). Finally, the solution is described as follows

$$\begin{aligned} {\varvec{\beta }}_{k + 1} = \mathbf{{H}}_{k + 1}^\dag \mathbf{{Y}}_{k + 1} = \left\{ {\begin{array}{ll} \mathbf{{H}}_{k + 1}^{\mathrm{T}} \left( {\frac{\mathbf{{I}}}{{C_{k + 1} }} + \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{H}}_{k + 1} } \right) ^{ - 1} \mathbf{{Y}}_{k + 1} ,&{}\quad { when}~{\left( {t_{k + 1} \times M} \right) } < L_{k + 1} \\ \left( {\frac{\mathbf{{I}}}{{C_{k + 1} }} + \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{H}}_{k + 1} } \right) ^{ - 1} \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{Y}}_{k + 1} ,&{}\quad { when}~\left( {t_{k + 1} \times M} \right) \ge L_{k + 1} \\ \end{array}} \right. . \end{aligned}$$
(30)

The method alters the parameters according to the fitting degree. The hidden nodes need not be tuned one by one. So VB-ELM is better suited for long-term online application. Take into account the learning procedure characteristic, the complete VB-ELM algorithm is presented in Algorithm 1.

figure a

Remark A

Regarding the main computational complexity, we express the node number of one layer as L. The particle range is indicated by M. Considering iteration number, the proposed method requires \(O\left( {N_k L_E^3 + N_k^{2}nM} \right) \) multiplications, including machine learning process and model update process. Due to the PSO method using optimum parameters, the computational complexity of the proposed algorithm is obviously greater than ELM. But it is worth noting that the proposed algorithm doesnt rely on trial and error, and long-term automatic prediction model has been achieved.

4 Experimental Results and Analysis

This section describes the experiment. Firstly, we list the datasets and evaluation criteria. Then, the results of the different methodologies are compared. VB-ELM is compared with other online regression methods. The parameter analysis is shown in the last part.

4.1 Datasets Description

The VB-ELM method proposed in this paper has mostly been applied to the long-term online prediction. In order to evaluate the effectiveness of the proposed method, there are four UCI datasets with different attributes to be selected [34]. Table 1 presents the specifications of each dataset.

Table 1 Specifications of the datasets

The first dataset (Temperature) is collected from a sensor network in a house. Output concept is dining-room temperature. The second dataset (Household) measures the electric power consumption. Globally active power is the output value. Then the averaged concentration CO is the forecasting target in the third dataset (AirQuality). The fourth dataset (Energy) includes the appliances energy data. The second attribute which is called energy use is chosen as the predicted label.

After analyzing the spatial–temporal distribution, the input element is composed of two parts. One is the reference value over the last week. The other one is some related factors. All the time series must be normalized as

$$\begin{aligned} s' = \frac{{s - s_{\min } }}{{s_{\max } - s_{\min } }} \end{aligned}$$
(31)

where s is an input numeric value, \(s_{\min }\) and \(s_{\max }\) are minimum and maximum of the selected feature data, respectively.

4.2 Evaluation Criteria

Multiple metric is required to evaluate the algorithm performance. In this paper, the generalization performances are judged in accordance with the Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Pearson Correlation Coefficient (PCC). They are shown as the following three equations.

$$\begin{aligned} { MAPE}= & {} \frac{1}{N}\sum \limits _{i = 1}^N {\left| {\frac{{y_i - {{\hat{y}}}_i }}{{y_i }}} \right| }, \end{aligned}$$
(32)
$$\begin{aligned} { RMSE}= & {} \sqrt{\frac{1}{N}\sum \limits _{i = 1}^N {\left( {y_i - {{\hat{y}}}_i } \right) ^2 } }, \end{aligned}$$
(33)
$$\begin{aligned} { PCC}= & {} \frac{{\sum _{i = 1}^N {\left( {y_i - {\bar{y}}_i } \right) \left( {{{\hat{y}}}_i - \overline{{{\hat{y}}}} _i } \right) } }}{\left( \sqrt{\sum _{i = 1}^N \left( {y_i - {\bar{y}}_i } \right) ^2}\right) \left( \sqrt{\sum _{i = 1}^N \left( {y_i - \overline{{\hat{y}}} _i } \right) ^2}\right) }, \end{aligned}$$
(34)

where \(y_i\) is the ith actual value in the tth chunk,\(i = 1, \ldots ,N\). \({\hat{y}}_i\) is the ith predicted value. \({\bar{y}}_i\) and \(\overline{{\hat{y}}} _i\) are respectively the mean of the \(y_i\) and \({\hat{y}}_i\).

4.3 Experimental Results

To verify the effectiveness of the proposed algorithm, online SVM, OS-ELM, D-ELM and PSO-EM-ELM are chosen to be contrastive methods. The test values in these methods are calculated by 50 independent Monte Carlo trials. In the first experiment, the prediction result of Dataset 1 is shown. Figure 2 shows the evaluation effect of an execution round for the Temperature.

Fig. 2
figure 2

Forecast value (a), and MAPE (b), RMSE (c), PCC (d) of the five methods for temperature

Forecasting value at the last 4 day is shown in Fig. 2a. The black line with dots describes true value, and the blue colour line describes the predicted value of proposed VB-ELM method. The two lines are well fitting. While in Fig. 2b and c, compared with the RMSE and MAPE, the proposed approach has smaller generalization error. Particularly in Fig. 2c, the RMSE is closed to zero, which is farther smaller than the other constructive methods. In Fig. 2d, the PCC of proposed method is always above the 0.95, which reflects a higher correlation and the better nonlinear mapping ability. These phenomena indicate that due to the ability of fast model predictive control, the new method is more suitable for long-term online prediction.

In the second experiment, the prediction result of Dataset 2 is shown. Figure 3 shows the evaluation value of an execution round for the Household.

Fig. 3
figure 3

Forecast value (a), and RMSE (b), MAPE (c), PCC (d) of the five methods for household

As seen in Fig. 3a, although the change of data is sharp, the proposed method doesnt lose efficacy. Then in Fig. 3b and c, it is obvious that the proposed method has the lower RMSE and MAPE than the state-of-the-art method. The RMSE is under 0.2, which is less than half the other methods. In Fig. 3d, the PCC of the proposed method is near to one. This situation reflects the proposed model has good fitting ability and sound adjustment. It is because the proposed method can match the change of nonlinear feature, and the overfitting is effectively avoided.

In the third experiment, the prediction result of the Dataset 3 is shown. Figure 4 shows the evaluation value of an execution round for the AirQuality.

Fig. 4
figure 4

Forecast value (a), and MAPE (b), RMSE (c), PCC (d) of the five methods for AirQuality

As seen in Fig. 4a, the proposed method remains efficient even though the sequential periodicity is blurry. In Fig. 4b and c, As time goes on, all generalization errors tend to downward, while the proposed method always keeps the least generalization error. It is because the model can handle serious overfitting phenomena. In Fig. 4b, the proposed method shows the best PCC than others. It proves that the proposed method has better capacity of adjustment. This advantage helps generate more suitable model by the arriving data.

In the last experiment, the prediction result of the Energy is shown. Figure 5 demonstrates the forecast and evaluation value.

Fig. 5
figure 5

Forecast value (a), and MAPE (b), RMSE (c), PCC (d) of the five methods for energy

As seen in Fig. 5a, although the adjacent data changed a lot, the proposed method still stays accurate forecasting value. In Fig. 5b, although the MAPE of the VB-ELM rises during the online learning process, the slope is smaller than the PSO-EM-ELM and the generalization performance maintain the best. In Fig. 5c, the RMSE increases firstly and then decreases. It is because the test data is discrete at times. In Fig. 5d, the proposed method has the better PPC than the contrastive method. It indicates the new dynamic ELM cannot lead to excessive learning. While the VB-ELM changed the model structure, it can control the optimal parameter, so the fluctuation of generalization performance is not violent.

For the four contrast methods, the proposed method represents satisfactory results. The test accuracy has improved significantly than Online SVM and OS-ELM. Then in general, the generalization performance is always better than D-ELM and PSO-EM-ELM. In the proposed method, the error trend is stable. It is proved that the method matches the data changes well for the long-term online prediction.

4.4 Parameter Analysis

In order to explain the validity of VB-ELM, representative parameters are analyzed in this part. For example, in the second chunk of the household dataset, the parameters analysis is shown in Fig. 6.

Fig. 6
figure 6

Analysis of representative parameters. a Test error with C and L, b MAPE

Firstly, it is attractive to analyze the regularization parameter C and hidden node number L. The two parameters are selected at the same time. As shown in Fig. 6a, when overfitting occurs, the result forces both C and L turn smaller. In this case, based on the PSO method, the C is invariant and the L is from 28 to 25. So, if the current chunk and the last chunk are not alike, the new parameters will support the online prediction well. Then as shown in Fig. 6b, with the growth of days, the error between the variance and the square of bias turns to zero. It is because the automatic optimization guarantees the balance of variance and bias. It reflects the foresight of the model when fitting departure happens.

In above experiments, the VB-ELM manifests the lower generalization error and the higher correlation. Furnished with the reliable fitting degree, the model can fit the changing overfitting and underfitting. Also, considering the decomposition of generalization error, adaptive parameters are beneficial to the nonlinear time series. It helps the VB-ELM minimize user intervention, so long-term online prediction can be realized fully automatically.

5 Conclusion

In this paper, an efficient VB-ELM is proposed for long-term online predication. Different from traditional online model, the proposed method determines model according to the changing test error. Update strategy not only balances the variance and bias but also gets joint optimal parameters by PSO, so the proposed method has better generalization performance. Fitting degree helps judge overfitting and underfitting quantificationally. The network reestablishment process based on user intervention is avoided. It makes the forecasting model suitable for non-expert user, and long-term online prediction process can be automatically controlled. Because of using PSO to determine the hyper-parameters, the proposed method surely spends more time than original ELM method. With the development of hardware resources, computational time in practical applications is becoming shorter. In future, the research work will look at how to predict unstructured data.