Problems
- Decaying learning rates provide googd compromise between escaping poor local minima and convergence
- Many of the convergence issues arise because we force the same learning rate on all parameters
- Try to releasing the requirement that a fixed step size is used across all dimensions
- To be clear, backpropagation is a way to compute derivative, not a algorithm
- The below is NOT grident desent algorithm
Better convergence strategy
Derivative-inspired algorithms
RProp
- Resilient propagation
- Simple first-order algorithm, to be followed independently for each component
- Steps in different directions are not coupled
- At each time
- If the derivative at the current location recommends continuing in the same direction as before (i.e. has not changed sign from earlier):
- Increase the step( α \alpha α > 1), and continue in the same direction
- If the derivative has changed sign (i.e. we’ve overshot a minimum)
- Reduce the step( β < 1 \beta < 1 β<1) and reverse direction
- If the derivative at the current location recommends continuing in the same direction as before (i.e. has not changed sign from earlier):
- Features
- It is frequently much more efficient than gradient descent
- No convexity assumption
QuickProp
- Newton updates
w ( k + 1 ) = w ( k ) − η A − 1 ∇ w E ( w ( k ) ) T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta \mathbf{A}^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T} w(k+1)=w(k)−ηA−1∇wE(w(k))T
-
Quickprop employs the Newton updates with two modifications
-
It treats each dimension independently
w i k + 1 = w i k − E ′ ′ ( w i k ∣ w j k , j ≠ i ) − 1 E ′ ( w i k ∣ w j k , j ≠ i ) w_{i}^{k+1}=w_{i}^{k}-E^{\prime \prime}\left(w_{i}^{k} | w_{j}^{k}, j \neq i\right)^{-1} E^{\prime}\left(w_{i}^{k} | w_{j}^{k}, j \neq i\right) wik+1=wik−E′′(wik∣wjk,j=i)−1E′(wik∣wjk,j=i)
- It approximates the second derivative through finite differences
w l , i j ( k + 1 ) = w l , i j ( k ) − Δ w l , i j ( k − 1 ) E r r ′ ( w l , i j ( k ) ) − E r r ′ ( w l , i j ( k − 1 ) ) Err ′ ( w l , i j ( k ) ) w_{l, i j}^{(k+1)}=w_{l, i j}^{(k)}-\frac{\Delta w_{l, i j}^{(k-1)}}{E r r^{\prime}\left(w_{l, i j}^{(k)}\right)-E r r^{\prime}\left(w_{l, i j}^{(k-1)}\right)} \operatorname{Err}^{\prime}\left(w_{l, i j}^{(k)}\right) wl,ij(k+1)=wl,ij(k)−Err′(wl,ij(k))−Err′(wl,ij(k−1))Δwl,ij(k−1)Err′(wl,ij(k))
- Features
- Employs Newton updates with empirically derived derivatives
- Prone to some instability for non-convex objective functions
- But is still one of the fastest training algorithms for many problems
Momentum methods
- Insight
- In the direction that converges, it keeps pointing in the same direction
- Need keep track of oscillations
- Emphasize steps in directions that converge smoothly
- In the direction that overshoots, it steps back and forth
- Shrink steps in directions that bounce around
- In the direction that converges, it keeps pointing in the same direction
- Maintain a running average of all past steps
- Get longer in directions where gradient stays in the same sign
- Become shorter in directions where the sign keep flipping
- Update with the running average, rather than the current gradient
- Emphasize directions of steady improvement are demonstrably superior to other methods
- Momentum Update
Δ W ( k ) = β Δ W ( k − 1 ) − η ∇ W Loss ( W ( k − 1 ) ) T \Delta W^{(k)}=\beta \Delta W^{(k-1)}-\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}\right)^{T} ΔW(k)=βΔW(k−1)−η∇WLoss(W(k−1))T
W ( k ) = W ( k − 1 ) + Δ W ( k ) {W^{(k)}=W^{(k-1)}+\Delta W^{(k)}} W(k)=W(k−1)+ΔW(k)
- First computes the gradient step at the current location: − η ∇ W Loss ( W ( k − 1 ) ) T -\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}\right)^{T} −η∇WLoss(W(k−1))T
- Then adds in the scaled previous step: β Δ W ( k − 1 ) \beta \Delta W^{(k-1)} βΔW(k−1)
Nestorov’s Accelerated Gradient
- Change the order of operations
Δ W ( k ) = β Δ W ( k − 1 ) − η ∇ W Loss ( W ( k − 1 ) + β Δ W ( k − 1 ) ) T \Delta W^{(k)}=\beta \Delta W^{(k-1)}-\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}+\beta \Delta W^{(k-1)}\right)^{T} ΔW(k)=βΔW(k−1)−η∇WLoss(W(k−1)+βΔW(k−1))T
W ( k ) = W ( k − 1 ) + Δ W ( k ) {W^{(k)}=W^{(k-1)}+\Delta W^{(k)}} W(k)=W(k−1)+ΔW(k)
- First extend the previous step: β Δ W ( k − 1 ) \beta \Delta W^{(k-1)} βΔW(k−1)
- Then compute the gradient step at the resultant position: − η ∇ W Loss ( W ( k − 1 ) + β Δ W ( k − 1 ) ) T -\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}+\beta \Delta W^{(k-1)}\right)^{T} −η∇WLoss(W(k−1)+βΔW(k−1))T
- Add the two to obtain the final step
- Converges much faster than momentum
Summary
- Try the step size for all dimension is bad
- Treat each dimension independently
- Try to normalize curvature in all directions
- Second order methods, e.g. Newton’s method
- Too expensive: require inversion of a giant Hessian
- Treat each dimension independently
- RProp / QucikProp
- Works, but ignores dependence between dimensions
- Can still be too slow
- Momentum methods which emphasize directions of steady improvement are demonstrably superior to other methods
Incremental updates
- Batch gradient descent
- Try to simultaneously adjust the function at all training points
- We must process all training points before making a single adjustment
- Stochastic gradient descent
- Adjust the function at one training point at a time
- A single pass through the entire training data is called an “epoch”
- An epoch over a training set with T T T samples result in T T T updates of parameters
- We must go through them randomly to get more convergent behavior
- Otherwise we may get cyclic behavior (hard to converge)
- Learning rate
- Correcting the function for individual instances will lead to never-ending, non-convergent updates (correct one, and miss the other)
- The learning will continuously “chase” the latest sample
- Correction for individual instances with the eventual miniscule learning rates will not modify the function
-
Drewbacks
- Batch / Stochastic gradient descent is an unbiased estimate of the expected loss
E [ Loss ( f ( X ; W ) , g ( X ) ) ] = E [ div ( f ( X ; W ) , g ( X ) ) ] E[\operatorname{Loss}(f(X ; W), g(X))]=E[\operatorname{div}(f(X ; W), g(X))] E[Loss(f(X;W),g(X))]=E[div(f(X;W),g(X))]
- But the variance of the empirical risk in batch gradient is 1 N \frac{1}{N} N1 times compared to stochastic gradient descent
- Like using 1 N ∑ X \frac{1}{N}\sum{X} N1∑X and X i X_i Xi to estimate X ˉ \bar{X} Xˉ
-
Mini-batch gradient descent
- Adjust the function at a small, randomly chosen subset of points
- Also an unbiased estimate of the expected error, and the variance is relatively small compared to SGD
- The mini-batch size is a hyper parameter to be optimized
- Convergence depends on learning rate
- Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10)
- Advanced methods: Adaptive updates, where the learning rate is itself determined as part of the estimation