A Dynamic ELM with Balanced Variance and Bias for Long-Term Online Prediction

Yu, Haiyang; Sun, Xiaoying; Wang, Jian

doi:10.1007/s11063-018-9865-x

A Dynamic ELM with Balanced Variance and Bias for Long-Term Online Prediction

Published: 19 June 2018

Volume 49, pages 1257–1271, (2019)
Cite this article

Download PDF

Neural Processing Letters Aims and scope Submit manuscript

A Dynamic ELM with Balanced Variance and Bias for Long-Term Online Prediction

Download PDF

Haiyang Yu¹,
Xiaoying Sun¹ &
Jian Wang^2,3

303 Accesses
2 Citations
Explore all metrics

Abstract

For long-term online prediction of nonlinear time series, how to determine a feasible network architecture that conforms to the time-varying data stream is recognized to be a challenging problem. To deal with the issue, a dynamic ELM with balanced variance and bias has been proposed. A suitable fitting degree, which contains model applicability at sequential learning phase, is taken into consideration. Based on the shifting error of the sequence fragment, the automatic model update strategy is exploited. Transformable parameters help reduce the overfitting and underfitting at the same time, and avoid the trial and error caused by user intervention effectively, so as to guarantee the feasibility for long-term online prediction. Furthermore, hidden node number and the regularization parameter can be calculated according to the fast-changing test data, thus building an optimum network architecture quantificationally. Experimental results verify that the proposed algorithm has better generalization performance on various long-term regression problems.

Adaptive online sequential extreme learning machine for dynamic modeling

Article 04 October 2020

Online Backpropagation of Recurrent Neural Network for Forecasting Nonstationary Structural Responses

Multilayer extreme learning machine: a systematic review

Article 22 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Long-term online prediction is an essential requirement in practical applications, such as power generation [1], traffic flow [2], air quality sensing [3], and temperature prediction [4]. For complex system modeling, large numbers of parameters need to be confirmed and the interactive learning process usually costs intolerable computational overhead. In long-term online learning process, the neural network cannot always be rebuilt through trial and error caused by user intervention. Therefore, how to adjust the model and parameters depends on the arrived data has been a hot research [5].

Aiming at regression problem, the key target is fitting. Absolutely the best training model for all datasets does not exist, and fitting deviation cannot be avoided [6]. Compared to underfitting, the overfitting is a hard nut to crack, because we must evaluate the overfitting degree at the test stage [7]. Punishment constraint is added in the candidate model so as to prevent the overfitting problem [8]. Due to the asymptotic convergence property, AIC is frequently used for the small sample [9]. In online scenario, dataset is arriving one by one or chunk by chunk. The regularity of input characteristics may be changed as time goes on, and the gradient descent algorithm will influence the predictive efficiency. As a single-hidden-layer feedforward network (SLFN), Extreme Learning Machine (ELM) needs no iterations, which is widely used on the classification and regression problem [10]. The dynamic neural network architecture extends the original ELM learning strategy to make it adequate for the online sequence (OS-ELM) [11]. Some novel ELMs raise hidden node number with arriving data. They have not only low computational overhead, but also remarkable capabilities for nonlinear mappings [12]. To avoid enlarging network scale without restriction, the incremental ELM (I-ELM) set a termination condition using the expected learning accuracy and maximum hidden node number [13]. Then the convex I-ELM (CI-ELM) [14] achieves faster convergence rate on account of adjusting the output weight, and the error minimized ELM (EM-ELM) [15] moves toward the predictive target with varying length hidden nodes. In order to make the neural networks more compact, the adaptive ELM [16] and the sparse ELM [17] has been proposed. These pruning methods all generate succinct links between different layer nodes. While they are inevitable to bear the high computational complexity due to the estimation of probability density [18].

So far, the most representative dynamic ELM for long-term online prediction is the D-ELM [19], it combines the EM-ELM and AG-ELM [20]. The method has a huge advantage on data fitting when the target function is not explicit. To optimize multiple parameters, some researchers bring in biological intelligent optimization method to acquire the best topologies of neural network [21,22,23]. But the batch learning processes are not suitable for online data stream, especially in the long-term prediction. Han et al. [24] presents a dynamic ELM based on particle swarm optimization(PSO) [25] to obtain the appropriate network parameters. Training error and condition performance are considered at the same time. Although the above online forecast methods can get local optimal model, they all depend on the original definition, including the maximum hidden node number and the ultimate iteration steps. If the overfitting and underfitting cannot be got rid of timely, generalization performance will become from bad to worse. It will lead to the long-term prediction model out of work.

In this paper, a novel dynamic ELM with balanced variance and bias (VB-ELM) has been presented. Expanded the traditional OS-ELM, the proposed method can acquire the beneficial parameters quantificationally. Relying on the degree of overfitting or underfitting, VB-ELM structure can be optimized automatically. Also, to reduce the approximation error, the spatial-temporal features are considered in the update process. Note that both model selection and constrained optimization have an impact on fitting degree, the proposed method deal with hidden node number and regularization parameter simultaneously. The whole forecast process can be done without manual trial and error operation, which have significant application value for long-term online prediction.

The remainder of this paper is organized as follows. In Sect. 2, we describe the preliminaries involving EM-ELM and approximation error theory. In Sect. 3, a new ELM using balanced variance and bias is introduced. Experimental results and discussion are conducted in Sect. 4. Finally, the conclusion is given in the last section.

2 Preliminaries

ELM has low computational complexity and easy hardware implementation [26]. It makes researchers come up with improved ELMs in the special applications, especially for big data processing and high dimensional prediction [27]. In this section, the preliminaries of the EM-ELM and the relationship between approximation error and hidden node number are described.

2.1 Brief on EM-ELM

EM-ELM is a modified ELM which adds hidden nodes with different size. With growth of the neural network, the output weight is updated owing to error minimization [28]. Suppose that the sample data is $\{ (\mathbf{{x}}_i ,y_i )\} _{i = 1}^N$, then the basic mapping relation can be expressed as

$$\begin{aligned} y_i = \sum \limits _{p = 1}^L {\beta _p \phi (\mathbf{{a}}_p \cdot \mathbf{{x}}_i + \mathbf{{b}}_p )},\quad i = 1, \ldots ,N \end{aligned}$$

(1)

where $\phi (\cdot )$ is activation function, L is the number of hidden node, $\mathbf{{a}}_p$ and $\mathbf{{b}}_p$ are the input weight vector and offset, respectively. The only goal to confirm network structure is to determine the output learning parameters ${\varvec{\beta }}$. Therefore, Eq. (1) can be rewritten in a more concise expression as

$$\begin{aligned} \mathbf{{Y}} = \mathbf{{H}}{\varvec{\beta }} \end{aligned}$$

(2)

where $\mathbf{{H}} = \left[ {\begin{array}{lll} {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_1 + b_1 )} &{} \ldots &{} {\phi (\mathbf{{a}}_L \cdot \mathbf{{x}}_1 + b_L )} \\ \vdots &{} \ldots &{} \vdots \\ {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_N + b_1 )} &{} \ldots &{} {\phi (\mathbf{{a}}_L \cdot \mathbf{{x}}_N + b_L )} \\ \end{array}} \right] _{N \times L}$, ${\varvec{\beta }} = [\beta _1 ,\beta _2 ,\ldots ,\beta _L ]^T$ and $\mathbf{{Y}} = \left[ {y_1 ,y_2 ,\ldots ,y_N } \right] ^T$.

Here $\mathbf{{H}}$ denotes the hidden-layer output matrix of this network. According to the least approximation error, the output weights can be determined as follows

$$\begin{aligned} {\varvec{\beta }} = \mathbf{{H}}^\dag \mathbf{{Y}} \end{aligned}$$

(3)

where $\mathbf{{H}}^{\dag }$ is the Moore–Penrose generalized inverse of $\mathbf{{H}}$, which is

$$\begin{aligned} \mathbf{{H}}^{\dag } = (\mathbf{{H}}^{\mathrm{T}} \mathbf{{H}})^{ - 1} \mathbf{{H}}^T, \end{aligned}$$

(4)

For EM-ELM, the target error $\varepsilon $ and the initial hidden node number $L_0$ are predefined. Firstly, the output weight matrix can be calculated by ${\varvec{\beta }}_1 = \mathbf{{H}}_{1}^{\dag } \mathbf{{Y}}$. If the approximation error is greater than $\varepsilon $, new hidden node is added. $\delta \mathbf{{H}}_1$ denotes hidden output matrix for the new node, and generalized inverse matrix is calculated as

$$\begin{aligned} \mathbf{{H}}_2^{\dag } = \left( \mathbf{{H}}_2^{\mathrm{T}} \mathbf{{H}}_2\right) ^{ - 1} \mathbf{{H}}_2^{\mathrm{T}} = \left( {\left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] ~\left[ {\mathbf{{H}}_1 ,\delta \mathbf{{H}}_1 } \right] } \right) ^{ - 1} \left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] , \end{aligned}$$

(5)

By substituting matrix $ \mathbf{{A}} = \left[ {\begin{array}{cc} {\mathbf{{A}}_{11} } &{}\quad {\mathbf{{A}}_{12} } \\ {\mathbf{{A}}_{21} } &{}\quad {\mathbf{{A}}_{22} } \\ \end{array}} \right] = \left( {\left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}}\right] ~\left[ {\mathbf{{H}}_1 ,\delta \mathbf{{H}}_1 } \right] } \right) ^{ - 1}$, the $\mathbf{{H}}_2^\dag $ can be expressed as following block matrix

$$\begin{aligned} \mathbf{{H}}_2^{\dag } = \left[ {\begin{array}{cc} {\mathbf{{A}}_{11} } &{}\quad {\mathbf{{A}}_{12} } \\ {\mathbf{{A}}_{21} } &{}\quad {\mathbf{{A}}_{22} } \\ \end{array}} \right] \left[ {\begin{array}{c} {\mathbf{{H}}_1^{\mathrm{T}} } \\ {\delta \mathbf{{H}}_1^{\mathrm{T}} } \\ \end{array}} \right] = \left[ {\begin{array}{c} {\mathbf{{U}}_1 } \\ {\mathbf{{D}}_1 } \\ \end{array}} \right] , \end{aligned}$$

(6)

Based on the Schur complement, the $\mathbf{{H}}_2^{\dag }$ can be evolved as

$$\begin{aligned} \mathbf{{D}}_1= & {} \left( {\left( {\mathbf{{I}} - \mathbf{{H}}_1 \mathbf{{H}}_1^{\dag } } \right) \delta \mathbf{{H}}_1 } \right) ^{\dag }, \end{aligned}$$

(7)

$$\begin{aligned} \mathbf{{U}}_1= & {} \mathbf{{H}}_1^\dag \left( {\mathbf{{I}} - \delta \mathbf{{H}}_1^{\mathrm{T}} \mathbf{{D}}_1 } \right) . \end{aligned}$$

(8)

The output weight can be calculated as

$$\begin{aligned} {\varvec{\beta }}_2 = \mathbf{{H}}_2^{\dag } \mathbf{{Y}} = \left[ {\begin{array}{c} {\mathbf{{U}}_1 } \\ {\mathbf{{D}}_1 } \\ \end{array}} \right] \mathbf{{Y}}. \end{aligned}$$

(9)

Similarly, the $(k+1)$th output weight are updated as follows

$$\begin{aligned} {\varvec{\beta }}_{k + 1} = \mathbf{{H}}_{k + 1}^\dag \mathbf{{Y}} = \left[ {\begin{array}{c} {\mathbf{{U}}_k } \\ {\mathbf{{D}}_k } \\ \end{array}} \right] \mathbf{{Y}}, \end{aligned}$$

(10)

where $\mathbf{{D}}_k = \left( {\left( {\mathbf{{I}} - \mathbf{{H}}_k \mathbf{{H}}_k^\dag } \right) \delta \mathbf{{H}}_1 } \right) ^\dag $ and $\mathbf{{U}}_k = \mathbf{{H}}_k^\dag \left( {\mathbf{{I}} - \delta \mathbf{{H}}_k^{\mathrm{T}} \mathbf{{D}}_k } \right) $. The mapping function can be defined by various forms, such as Sigmoid function and Gaussian function. They all make ELM well-adapted.

2.2 Approximation Error Theory

ELM can calculate output weight analytically. With the hidden node growing, training error will keep on decreasing. Considering the original ELM frame, the learning target is minimizing the norm of approximation error as follow

$$\begin{aligned} { error} = {\mathop {\min }\limits _{\varvec{\beta }}} \left\| \mathbf{{H}}{\varvec{\beta }} - \mathbf{{Y}} \right\| , \end{aligned}$$

(11)

Because ${\varvec{\beta }}^*$ is calculated by the Moore–Penrose generalized inverse, it satisfies the necessary condition of minimum norm least-squares solution, namely $\left\| {{\varvec{\beta }}^*} \right\| _n \le \left\| {\varvec{\beta }} \right\| _n$ [29]. The relationship of approximation error can be written as follows

$$\begin{aligned} \forall {\varvec{\beta }} \in \left\{ {\varvec{\beta }}:\left\| \mathbf{{H}}{\varvec{\beta }} - \mathbf{{Y}} \right\| _m \le \left\| {\mathbf{{Hz}} - \mathbf{{Y}}} \right\| _m ,\forall \mathbf{{Y}} \in R^m,\mathbf{{z}} \in R^n \right\} , \end{aligned}$$

(12)

where $\left\| \cdot \right\| _n$ and $\left\| \cdot \right\| _m$ is the norm of $R^n$, $R^m$. The $\left\{ \cdot \right\} $ is the least square solution of the $\mathbf{{H}}{\varvec{\beta }} =\mathbf{{Y}}$. When hidden nodes number from $L_1$ to $L_2$, the new error is smaller or equal to that of the previous ELM. The proof process can be indicated as follows.

$$\begin{aligned} { error}_2= & {} \left\| {\mathbf{{H}}_{L_2 } {\varvec{\beta }}_{L_2 } - \mathbf{{Y}}} \right\| = \left\| {\left[ {\mathbf{{H}}_{L_1 } ,\mathbf{{H}}_{L_2 - L_1 } } \right] {\varvec{\beta }}_{L_2 } - \mathbf{{Y}}} \right\| \nonumber \\\le & {} \left\| {\left[ {\mathbf{{H}}_{L_1 } ,\mathbf{{H}}_{L_2 - L_1 } } \right] \left[ {\begin{array}{c} {\varvec{\beta }}_{L_1 } \\ 0 \\ \end{array}} \right] - {\mathbf{Y }}} \right\| = \left\| \mathbf{{H}}_{L_1} {\varvec{\beta }}_{L_1 } - \mathbf{{Y}} \right\| = { error}_1. \end{aligned}$$

(13)

Because ${\varvec{\beta }}_{L_2}$ is the solution of M–P generalized inverse, it has the minimum norm. Due to the $\left[ {{\varvec{\beta }}_{L_1 } ,0} \right] ^{\mathrm{T}}$ may be not the minimum norm least squares solution, so the in equation in formula (13) exists. With the hidden node number becomes greater, the decreasing approximation error leads to overfitting, so learning target must be chosen again.

3 The Proposed Method

In this section, we introduce the proposed VB-ELM in detail. The proposed algorithm includes two phases: initialization learning and model update. At initial learning step, only a small portion of training data is used. Then at model update phase, the hidden node number and the regularization parameter can adjust to the fitting degree. Model parameters are confirmed according to the arriving data.

3.1 Initialization Learning Phase

First of all, it is necessary to obtain the original model. Considering the initial $t_0$ sample chunk ${ Set}_0 = \{(\mathbf{{x}}_i,y_i )\} _{i = 1}^{t_0 \times M}$, $t_0$-fold cross validation is employed in the training process. t0-fold cross validation is employed in the training process. The whole training data is divided into $t_{0}$ part. Each part is selected as the test dataset successively and the other $t_{0}-1$ parts are corresponding training datasets. $C_0$ is regularization parameter, which can be confirmed according to the error of $t_0$-fold cross validation. Mathematically, for the first fold, the fitting problem turns to be an optimization process as follows

$$\begin{aligned} \begin{array}{l} { Minimize}:{ error}_0 = \frac{1}{2}\left\| {{\varvec{\beta }}_0 } \right\| + C_0 \frac{1}{2}\sum \limits _{i = 1}^{{\left( {t_0 - 1} \right) } \times M} {\left\| {\xi _i } \right\| ^2 } \\ {Subject\,to}:~\mathbf{{h}}\left( {\mathbf{{x}}_i } \right) {\varvec{\beta }}_0 = y_i^{\mathrm{T}} - \xi _i^{\mathrm{T}} ,i = 1, \ldots ,\left( {t_0 - 1} \right) \times M\\ \end{array}, \end{aligned}$$

(14)

where M is the length of each sample, ${\varvec{\beta }}_0$ is the output weight, and $\mathbf{{h}}\left( {\mathbf{{x}}_i } \right) $ is the mapping vector of hidden layer. What is more, $y_i$ is the expected output and $\xi _i$ expresses predicted error. According to the KKT theorem [30], the solution is described as follows

$$\begin{aligned} {\varvec{\beta }}_0 = \mathbf{{H}}_0^\dag \mathbf{{Y}}_0 = \left\{ {\begin{array}{ll} \mathbf{{H}}_0^{\mathrm{T}} \left( {\frac{\mathbf{{I}}}{{C_0 }} + \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{H}}_0 } \right) ^{ - 1} \mathbf{{Y}}_0,\quad { when}~{{\left( {t_0 - 1} \right) } \times M} < L_0 \\ \left( {\frac{\mathbf{{I}}}{{C_0 }} + \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{H}}_0 } \right) ^{ - 1} \mathbf{{H}}_0^{\mathrm{T}} \mathbf{{Y}}_0 ,\quad when\;{{\left( {t_0 - 1} \right) } \times M} \ge L_0 \\ \end{array}} \right. , \end{aligned}$$

(15)

where $L_0$ is the hidden node number at initial learning step, and

$$\begin{aligned} \mathbf{{H}}_0 = \left[ {\begin{array}{ccc} {\phi (\mathbf{{a}}_1\cdot \mathbf{{x}}_1 + b_1 )} &{}\quad \ldots &{}\quad {\phi (\mathbf{{a}}_{L_0 } \cdot \mathbf{{x}}_1 + b_{L_0 } )} \\ \vdots &{}\quad \ldots &{}\quad \vdots \\ {\phi (\mathbf{{a}}_1 \cdot \mathbf{{x}}_{{\left( {t_0 - 1} \right) } \times M} + b_1 )} &{}\quad \ldots &{}\quad {\phi (\mathbf{{a}}_{L_0 } \cdot \mathbf{{x}}_{{\left( {t_0 - 1} \right) } \times M} + b_{L_0 } )} \\ \end{array}} \right] _{({{\left( {t_0 - 1} \right) } \times M}) \times L_0 }. \end{aligned}$$

Compared with the errors at $t_{0}$-fold cross validation, ${\bar{\varvec{\beta }}}_0$ with minimum error is chosen as the forecast model. The average validation error plays a role of tolerant threshold. Thinking about the test sample chunk $\mathbf{{X}}_1$, the generalization error can be decomposed as [31]

$$\begin{aligned} { error}_1 = { Bias}_1^2 (\mathbf{{X}}_1 ) + { Var}_1 (\mathbf{{X}}_1 ) + \varepsilon ^2, \end{aligned}$$

(16)

where ${ Bias}_1^2 (\mathbf{{X}}_1 )$ is the square of predicted bias, ${ Var}_1 (\mathbf{{X}}_1 )$ is the variance, and $\varepsilon $ represents the noise. Usually, $\varepsilon $ is Gaussian White Noise, so the mean value can be regarded as zero. After the test processing, bias and variance in initialization learning phase is shown as

$$\begin{aligned} { Bias}_1 (\mathbf{{X}}_1 )= & {} \mathbf{{H}}_1 {\bar{\varvec{\beta }}}_0 - \mathbf{{Y}}_1, \end{aligned}$$

(17)

$$\begin{aligned} { Var}_1 (\mathbf{{X}}_1 )= & {} \frac{1}{{t_1 }}\sum _{v=1}^{t_1} \left( \mathbf{{H}}_1^v {\varvec{\beta }}_1^v - \mathbf{{H}}_1 {\bar{\varvec{\beta }}}_0 \right) ^2 , \end{aligned}$$

(18)

where $\mathbf{{H}}_1^v {\varvec{\beta }}_1^v$ is the predicted value in vth fold. When the ($t_0+1$)th sample chunk arrived, the training data turn to $t_1$ fold.

3.2 Model Update Phase

In model update phase, hyper-parameters are determined according to the prior chunk. The changing degree of bias or variance decides fitting deviation [8]. In general, bias and variance are opposite. It can be called bias-variance dilemma. The relationship of generalization error, bias and variance is shown in Fig. 1 [32].

From Fig. 1, we can see that bias is the primary element for generalization error when the fitting ability is insufficient. Then overfitting or underfitting phenomenon is able to be judged as follows

$$\begin{aligned} { Var}_k - { Bias}_k^2 \mathop \gtrless \limits _{H_2}^{H_1}{} { Var}_{k - 1}-{ Bias}_{k - 1}^2, \end{aligned}$$

(19)

where ${ Bias}_k = \left\| {{ Bias}_k (\mathbf{{X}}_k )} \right\| $ and ${ Var}_k = \left\| {{ Var}_k (\mathbf{{X}}_k )} \right\| $ are the mean value of the kth test chunk. ${ Bias}_{k - 1}$ and ${ Var}_{k - 1}$ are mean bias and variance of the (k-1)th test chunk, respectively. If $H_1$ event is established, overfitting occurs. Otherwise, if $H_2$ event is established, underfitting exists. According to the paper [33], fitting degree p is considered as follows

$$\begin{aligned} p = \frac{{ Bias}_{k,\max }^v-{ Bias}_k}{{\hat{\gamma }}_{k}- { Bias}_k}, \end{aligned}$$

(20)

where ${ Bias}_{k,\max }^v$ is the least error of vth validation fold, ${ Bias}_k$ is the test error at kth chuck, and ${\hat{\gamma }}_{k}$ is the no-information error rate in current phase. ${ Bias}_{k,\max }^v$ is calculated as follows

$$\begin{aligned} { Bias}_{k,\max }^v = {\mathop {\max }\limits _{t_k}} \left\| \frac{1}{t_k}\left( \mathbf{{H}}_k^v {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_k^v \right) \right\| . \end{aligned}$$

(21)

${\hat{\gamma }}_{k}$ describes the least error rate in the known dataset, which is estimated as

$$\begin{aligned} {\hat{\gamma }} _k = \left\| {\frac{1}{{t_k^2 }}\sum \limits _{i = 1}^{t_k } {\sum \limits _{j = 1,j \ne i}^{t_k } {(\mathbf{{H}}_j {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_i)}}}\right\| . \end{aligned}$$

(22)

It is obvious that the range of p is from zero to one. When the value is zero, the model and new data fit each other. If p rises one, that means the model has a large deviation. On the basis of p, the regularization parameter threshold for ($k+1$)th chunk moves as follows

$$\begin{aligned} C_{k + 1}^{{ bound}} = \left\{ {\begin{array}{ll} \left\lfloor {C_k \cdot (1 - p)} \right\rfloor ,&{}\quad { overfitting} \\ \left\lceil {C_k \cdot (1 + p)} \right\rceil ,&{}\quad { underfitting} \\ \end{array}} \right. , \end{aligned}$$

(23)

where $C_k$ is the current value. $\left\lfloor \cdot \right\rfloor $ and $\left\lceil \cdot \right\rceil $ describe an integer which are rounded down and up. The threshold of hidden node number is changed with the following formula

$$\begin{aligned} L_{k + 1}^{{ bound}} = \left\{ {\begin{array}{ll} \left\lfloor {L_k \cdot (1 - p)} \right\rfloor ,&{}\quad { overfitting} \\ \left\lceil {L_k \cdot (1 + p)} \right\rceil ,&{}\quad { underfitting} \\ \end{array}} \right. , \end{aligned}$$

(24)

where $L_k$ is the current threshold value. Then the joint optimal solution is got by PSO method as follows

$$\begin{aligned} V_{L_i ,C_j }^{n + 1}= & {} V_{L_i ,C_j }^n + c_1 \times r_1 \times \left[ {B_{L_i ,C_j }^n - P_{L_i ,C_j }^n } \right] + c_2 \times r_2 \times \left[ {B_{L_g ,C_j } - P_{L_i ,C_j }^n } \right] ,\end{aligned}$$

(25)

$$\begin{aligned} P_{L_i ,C_j }^{n + 1}= & {} P_{L_i ,C_j }^n + V_{L_i ,C_j }^{n + 1} ,\quad L_{\min } \le L_i \le L_{\max } ,C_{\min } \le C_j \le C_{\max }, \end{aligned}$$

(26)

where V and P express velocity and position. n is the iteration number. $B_{L_i ,C_j }^n$ is the nth personal best particle. $B_{L_g ,C_j }$ is group optimal location. $L_{\min } = \min \left\{ {L_{k + 1}^{{ bound}} ,L_k } \right\} $, $L_{\max } = \max \left\{ {L_{k + 1}^{{ bound}} ,L_k } \right\} $, $C_{\min } = \min \left\{ {C_{k + 1}^{{ bound}} ,C_k } \right\} $, and $C_{\max } = \max \left\{ {C_{k + 1}^{{ bound}} ,C_k } \right\} $. $r_1$ and $r_2$ are uniform random numbers at $\left[ {0,1} \right] $. $c_1$ and $c_2$ are positive constants. The fitness function is calculated as follows

$$\begin{aligned} { fitness} = \left\| {\mathbf{{H}}_k {\bar{\varvec{\beta }}}_{k-1} - \mathbf{{Y}}_k } \right\| . \end{aligned}$$

(27)

The update formula for getting the best parameter group is shown as follows

$$\begin{aligned} B_{L_i ,C_j }^{n + 1}= & {} \left\{ \begin{array}{ll} P_{L_i ,C_j }^n,&{}\quad { if}\quad { fitness}\left( {P_{L_i ,C_j }^n } \right) < { fitness}\left( {B_{L_i ,C_j }^n } \right) \\ B_{L_i ,C_j }^n,&{}\quad { else} \\ \end{array} \right. ,\end{aligned}$$

(28)

$$\begin{aligned} B_{L_g ,C_j}= & {} \min \left\{ {B_{L_{\min } ,C_j }^{n + 1} ,B_{L_i ,C_j }^{n + 1} , \ldots ,B_{L_{\max } ,C_j }^{n + 1} } \right\} . \end{aligned}$$

(29)

After acquired the optimal fitness, the corresponding particles is $L_i ^\prime $ and $C_i ^\prime $. We can confirm $L_{k + 1} = L_i ^\prime $ and $C_{k + 1} = C_i ^\prime $. Finally, the solution is described as follows

$$\begin{aligned} {\varvec{\beta }}_{k + 1} = \mathbf{{H}}_{k + 1}^\dag \mathbf{{Y}}_{k + 1} = \left\{ {\begin{array}{ll} \mathbf{{H}}_{k + 1}^{\mathrm{T}} \left( {\frac{\mathbf{{I}}}{{C_{k + 1} }} + \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{H}}_{k + 1} } \right) ^{ - 1} \mathbf{{Y}}_{k + 1} ,&{}\quad { when}~{\left( {t_{k + 1} \times M} \right) } < L_{k + 1} \\ \left( {\frac{\mathbf{{I}}}{{C_{k + 1} }} + \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{H}}_{k + 1} } \right) ^{ - 1} \mathbf{{H}}_{k + 1}^{\mathrm{T}} \mathbf{{Y}}_{k + 1} ,&{}\quad { when}~\left( {t_{k + 1} \times M} \right) \ge L_{k + 1} \\ \end{array}} \right. . \end{aligned}$$

(30)

The method alters the parameters according to the fitting degree. The hidden nodes need not be tuned one by one. So VB-ELM is better suited for long-term online application. Take into account the learning procedure characteristic, the complete VB-ELM algorithm is presented in Algorithm 1.

Remark A

Regarding the main computational complexity, we express the node number of one layer as L. The particle range is indicated by M. Considering iteration number, the proposed method requires $O\left( {N_k L_E^3 + N_k^{2}nM} \right) $ multiplications, including machine learning process and model update process. Due to the PSO method using optimum parameters, the computational complexity of the proposed algorithm is obviously greater than ELM. But it is worth noting that the proposed algorithm doesnt rely on trial and error, and long-term automatic prediction model has been achieved.

4 Experimental Results and Analysis

This section describes the experiment. Firstly, we list the datasets and evaluation criteria. Then, the results of the different methodologies are compared. VB-ELM is compared with other online regression methods. The parameter analysis is shown in the last part.

4.1 Datasets Description

The VB-ELM method proposed in this paper has mostly been applied to the long-term online prediction. In order to evaluate the effectiveness of the proposed method, there are four UCI datasets with different attributes to be selected [34]. Table 1 presents the specifications of each dataset.

Table 1 Specifications of the datasets

Full size table

The first dataset (Temperature) is collected from a sensor network in a house. Output concept is dining-room temperature. The second dataset (Household) measures the electric power consumption. Globally active power is the output value. Then the averaged concentration CO is the forecasting target in the third dataset (AirQuality). The fourth dataset (Energy) includes the appliances energy data. The second attribute which is called energy use is chosen as the predicted label.

After analyzing the spatial–temporal distribution, the input element is composed of two parts. One is the reference value over the last week. The other one is some related factors. All the time series must be normalized as

$$\begin{aligned} s' = \frac{{s - s_{\min } }}{{s_{\max } - s_{\min } }} \end{aligned}$$

(31)

where s is an input numeric value, $s_{\min }$ and $s_{\max }$ are minimum and maximum of the selected feature data, respectively.

4.2 Evaluation Criteria

Multiple metric is required to evaluate the algorithm performance. In this paper, the generalization performances are judged in accordance with the Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Pearson Correlation Coefficient (PCC). They are shown as the following three equations.

$$\begin{aligned} { MAPE}= & {} \frac{1}{N}\sum \limits _{i = 1}^N {\left| {\frac{{y_i - {{\hat{y}}}_i }}{{y_i }}} \right| }, \end{aligned}$$

(32)

$$\begin{aligned} { RMSE}= & {} \sqrt{\frac{1}{N}\sum \limits _{i = 1}^N {\left( {y_i - {{\hat{y}}}_i } \right) ^2 } }, \end{aligned}$$

(33)

$$\begin{aligned} { PCC}= & {} \frac{{\sum _{i = 1}^N {\left( {y_i - {\bar{y}}_i } \right) \left( {{{\hat{y}}}_i - \overline{{{\hat{y}}}} _i } \right) } }}{\left( \sqrt{\sum _{i = 1}^N \left( {y_i - {\bar{y}}_i } \right) ^2}\right) \left( \sqrt{\sum _{i = 1}^N \left( {y_i - \overline{{\hat{y}}} _i } \right) ^2}\right) }, \end{aligned}$$

(34)

where $y_i$ is the ith actual value in the tth chunk,$i = 1, \ldots ,N$. ${\hat{y}}_i$ is the ith predicted value. ${\bar{y}}_i$ and $\overline{{\hat{y}}} _i$ are respectively the mean of the $y_i$ and ${\hat{y}}_i$.

4.3 Experimental Results

To verify the effectiveness of the proposed algorithm, online SVM, OS-ELM, D-ELM and PSO-EM-ELM are chosen to be contrastive methods. The test values in these methods are calculated by 50 independent Monte Carlo trials. In the first experiment, the prediction result of Dataset 1 is shown. Figure 2 shows the evaluation effect of an execution round for the Temperature.

Forecasting value at the last 4 day is shown in Fig. 2a. The black line with dots describes true value, and the blue colour line describes the predicted value of proposed VB-ELM method. The two lines are well fitting. While in Fig. 2b and c, compared with the RMSE and MAPE, the proposed approach has smaller generalization error. Particularly in Fig. 2c, the RMSE is closed to zero, which is farther smaller than the other constructive methods. In Fig. 2d, the PCC of proposed method is always above the 0.95, which reflects a higher correlation and the better nonlinear mapping ability. These phenomena indicate that due to the ability of fast model predictive control, the new method is more suitable for long-term online prediction.

In the second experiment, the prediction result of Dataset 2 is shown. Figure 3 shows the evaluation value of an execution round for the Household.

As seen in Fig. 3a, although the change of data is sharp, the proposed method doesnt lose efficacy. Then in Fig. 3b and c, it is obvious that the proposed method has the lower RMSE and MAPE than the state-of-the-art method. The RMSE is under 0.2, which is less than half the other methods. In Fig. 3d, the PCC of the proposed method is near to one. This situation reflects the proposed model has good fitting ability and sound adjustment. It is because the proposed method can match the change of nonlinear feature, and the overfitting is effectively avoided.

In the third experiment, the prediction result of the Dataset 3 is shown. Figure 4 shows the evaluation value of an execution round for the AirQuality.

As seen in Fig. 4a, the proposed method remains efficient even though the sequential periodicity is blurry. In Fig. 4b and c, As time goes on, all generalization errors tend to downward, while the proposed method always keeps the least generalization error. It is because the model can handle serious overfitting phenomena. In Fig. 4b, the proposed method shows the best PCC than others. It proves that the proposed method has better capacity of adjustment. This advantage helps generate more suitable model by the arriving data.

In the last experiment, the prediction result of the Energy is shown. Figure 5 demonstrates the forecast and evaluation value.

As seen in Fig. 5a, although the adjacent data changed a lot, the proposed method still stays accurate forecasting value. In Fig. 5b, although the MAPE of the VB-ELM rises during the online learning process, the slope is smaller than the PSO-EM-ELM and the generalization performance maintain the best. In Fig. 5c, the RMSE increases firstly and then decreases. It is because the test data is discrete at times. In Fig. 5d, the proposed method has the better PPC than the contrastive method. It indicates the new dynamic ELM cannot lead to excessive learning. While the VB-ELM changed the model structure, it can control the optimal parameter, so the fluctuation of generalization performance is not violent.

For the four contrast methods, the proposed method represents satisfactory results. The test accuracy has improved significantly than Online SVM and OS-ELM. Then in general, the generalization performance is always better than D-ELM and PSO-EM-ELM. In the proposed method, the error trend is stable. It is proved that the method matches the data changes well for the long-term online prediction.

4.4 Parameter Analysis

In order to explain the validity of VB-ELM, representative parameters are analyzed in this part. For example, in the second chunk of the household dataset, the parameters analysis is shown in Fig. 6.

Firstly, it is attractive to analyze the regularization parameter C and hidden node number L. The two parameters are selected at the same time. As shown in Fig. 6a, when overfitting occurs, the result forces both C and L turn smaller. In this case, based on the PSO method, the C is invariant and the L is from 28 to 25. So, if the current chunk and the last chunk are not alike, the new parameters will support the online prediction well. Then as shown in Fig. 6b, with the growth of days, the error between the variance and the square of bias turns to zero. It is because the automatic optimization guarantees the balance of variance and bias. It reflects the foresight of the model when fitting departure happens.

In above experiments, the VB-ELM manifests the lower generalization error and the higher correlation. Furnished with the reliable fitting degree, the model can fit the changing overfitting and underfitting. Also, considering the decomposition of generalization error, adaptive parameters are beneficial to the nonlinear time series. It helps the VB-ELM minimize user intervention, so long-term online prediction can be realized fully automatically.

5 Conclusion

In this paper, an efficient VB-ELM is proposed for long-term online predication. Different from traditional online model, the proposed method determines model according to the changing test error. Update strategy not only balances the variance and bias but also gets joint optimal parameters by PSO, so the proposed method has better generalization performance. Fitting degree helps judge overfitting and underfitting quantificationally. The network reestablishment process based on user intervention is avoided. It makes the forecasting model suitable for non-expert user, and long-term online prediction process can be automatically controlled. Because of using PSO to determine the hyper-parameters, the proposed method surely spends more time than original ELM method. With the development of hardware resources, computational time in practical applications is becoming shorter. In future, the research work will look at how to predict unstructured data.

References

Golestaneh F, Pinson P, Gooi HB (2016) Very short-term nonparametric probabilistic forecasting of renewable energy generation—with application to solar energy. IEEE Trans Power Syst 31(5):3850–3863
Article Google Scholar
Hu W, Yan L, Liu K et al (2016) A short-term traffic flow forecasting method based on the hybrid PSO-SVR. Neural Process Lett 43(1):155–172
Article Google Scholar
Kumar P, Martani C, Morawska L et al (2016) Indoor air quality and energy management through real-time sensing in commercial buildings. Energy Build 111:145–153
Article Google Scholar
Tian HX, Mao ZZ (2010) An ensemble ELM based on modified AdaBoost. RT algorithm for predicting the temperature of molten steel in ladle furnace. IEEE Trans Autom Sci Eng 7(1):73–80
Article Google Scholar
Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11(Jul):2079–2107
MathSciNet MATH Google Scholar
Ding S, Li Y, Zhu J et al (2015) Sequential sample consensus: a robust algorithm for video-based face recognition. IEEE Trans Circuits Syst Video Technol 25(10):1586–1598
Article Google Scholar
Trillos NG, Murray R (2017) A new analytical approach to consistency and overfitting in regularized empirical risk minimization. Eur J Appl Math 28(6):886–921
Article MathSciNet MATH Google Scholar
Srivastava N, Hinton GE, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Richards SA, Whittingham MJ, Stephens PA (2011) Model selection and model averaging in behavioural ecology: the utility of the IT-AIC framework. Behav Ecol Sociobiol 65(1):77–89
Article Google Scholar
Huang G, Huang GB, Song S et al (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Article MATH Google Scholar
Liang NY, Huang GB, Saratchandran P et al (2006) A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans Neural Netw 17(6):1411–1423
Article Google Scholar
Liu D, Wu YX, Jiang H (2016) FP-ELM: An online sequential learning algorithm for dealing with concept drift. Neurocomputing 207:322–334
Article Google Scholar
Huang GB, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892
Article Google Scholar
Huang GB, Chen L (2007) Convex incremental extreme learning machine. Neurocomputing 70(16):3056–3062
Article Google Scholar
Feng G, Huang GB, Lin Q et al (2009) Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Trans Neural Netw 20(8):1352–1357
Article Google Scholar
Cao J, Lin Z, Huang G-B (2012) Self-adaptive evolutionary extreme learning machine. Neural Process Lett 36(3):285–305
Article Google Scholar
Bai Z, Huang GB, Wang D et al (2014) Sparse extreme learning machine for classification. IEEE Trans Cybern 44(10):1858–1870
Article Google Scholar
Castaño A, Fernández-Navarro F, Hervás-Martínez C (2013) PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis. Neural Process Lett 37(3):377–392
Article Google Scholar
Zhang R, Lan Y, Huang GB et al (2013) Dynamic extreme learning machine and its approximation capability. IEEE Trans Cybern 43(6):2054–2065
Article Google Scholar
Zhang R, Lan Y, Huang G et al (2012) Universal approximation of extreme learning machine with adaptive growth of hidden nodes. IEEE Trans Neural Netw Learn Syst 23(2):365–371
Article Google Scholar
Grigorievskiy A, Miche Y, Ventelä AM et al (2014) Long-term time series prediction using OP-ELM. Neural Netw 51:50–56
Article MATH Google Scholar
Savitha R, Suresh S, Kim HJ (2014) A meta-cognitive learning algorithm for an extreme learning machine classifier. Cogn Comput 6(2):253–263
Article Google Scholar
Figueiredo EMN, Ludermir TB (2014) Investigating the use of alternative topologies on performance of the PSO-ELM. Neurocomputing 127:4–12
Article Google Scholar
Han F, Zhao MR, Zhang JM et al (2017) An improved incremental constructive single-hidden-layer feedforward networks for extreme learning machine based on particle swarm optimization. Neurocomputing 228:133–142
Article Google Scholar
Du KL, Swamy MNS (2016) Particle swarm optimization. In: Search and optimization by metaheuristics. Springer, pp 153–173
Han M, Zhang R, Xu M (2017) Multivariate chaotic time series prediction based on ELM-PLSR and hybrid variable selection algorithm. Neural Process Lett 46(2):705–717
Article Google Scholar
Cao J, Lin Z (2015) Extreme learning machines on high dimensional and large data applications: a survey. Math Probl Eng 2015:1–13
MATH Google Scholar
Zhai J, Shao Q, Wang X (2016) Architecture selection of ELM networks based on sensitivity of hidden nodes. Neural Process Lett 44(2):471–489
Article Google Scholar
Shao Z, Er MJ, Wang N (2016) An efficient leave-one-out cross-validation-based extreme learning machine (ELOO-ELM) with minimal user intervention. IEEE Trans Cybern 46(8):1939–1951
Article Google Scholar
Wu HC (2007) The Karush–Kuhn–Tucker optimality conditions in an optimization problem with interval-valued objective function. Eur J Oper Res 176(1):46–59
Article MathSciNet MATH Google Scholar
Taieb SB, Atiya AF (2016) A bias and variance analysis for multistep-ahead time series forecasting. IEEE Trans Neural Netw Learn Syst 27(1):62–76
Article MathSciNet Google Scholar
Lever J, Krzywinski M, Altman N (2016) Points of significance: model selection and overfitting. Nat Methods 13(9):703–704
Article Google Scholar
Hothorn T, Lausen B (2003) Bagging tree classifiers for laser scanning images: a data-and simulation-based strategy. Artif Intell Med 27(1):65–79
Article Google Scholar
Frank A, Asuncion A (2017) UCI machine learning repository. University California Irvine. http://archive.ics.uci.edu/ml

Download references

Acknowledgements

The work was supported by the National Key Research Project of China under Grant No. 2016YFB1001304, the National Natural Science Foundation of China under Grant 61572229, the JLUSTIRT High-level Innovation Team, and the Fundamental Research Funds for Central Universities under Grant No. 2017TD-19. The authors gratefully acknowledge financial support from the Research Centre for Intelligent Signal Identification and Equipment, Jilin Province.

Author information

Authors and Affiliations

College of Communication Engineering, Jilin University, Changchun, 130022, Jilin, China
Haiyang Yu & Xiaoying Sun
College of Computer Science and Technology, Jilin University, Changchun, 130012, Jilin, China
Jian Wang
Department of Intelligent Vehicle, China Automotive Engineering Research Institute (CAERI), Chongqing, 404100, China
Jian Wang

Authors

Haiyang Yu
View author publications
Search author on:PubMed Google Scholar
Xiaoying Sun
View author publications
Search author on:PubMed Google Scholar
Jian Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xiaoying Sun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, H., Sun, X. & Wang, J. A Dynamic ELM with Balanced Variance and Bias for Long-Term Online Prediction. Neural Process Lett 49, 1257–1271 (2019). https://doi.org/10.1007/s11063-018-9865-x

Download citation

Published: 19 June 2018
Version of record: 19 June 2018
Issue date: 15 June 2019
DOI: https://doi.org/10.1007/s11063-018-9865-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Dynamic ELM with Balanced Variance and Bias for Long-Term Online Prediction

Abstract

Similar content being viewed by others

Adaptive online sequential extreme learning machine for dynamic modeling

Online Backpropagation of Recurrent Neural Network for Forecasting Nonstationary Structural Responses

Multilayer extreme learning machine: a systematic review

Explore related subjects

1 Introduction

2 Preliminaries

2.1 Brief on EM-ELM

2.2 Approximation Error Theory

3 The Proposed Method

3.1 Initialization Learning Phase

3.2 Model Update Phase

Remark A

4 Experimental Results and Analysis

4.1 Datasets Description

4.2 Evaluation Criteria

4.3 Experimental Results

4.4 Parameter Analysis

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords