CMU 11-785 L06 Optimization

本文探讨了神经网络训练过程中的优化策略,包括动量方法、RProp、QuickProp及Nesterov加速梯度等算法。这些方法通过调整学习率、利用历史梯度信息和独立处理每个维度,提高了收敛速度并避免局部最小值陷阱。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Problems

  • Decaying learning rates provide googd compromise between escaping poor local minima and convergence
  • Many of the convergence issues arise because we force the same learning rate on all parameters
  • Try to releasing the requirement that a fixed step size is used across all dimensions
  • To be clear, backpropagation is a way to compute derivative, not a algorithm
    • The below is NOT grident desent algorithm

Better convergence strategy

Derivative-inspired algorithms

RProp
  • Resilient propagation
  • Simple first-order algorithm, to be followed independently for each component
    • Steps in different directions are not coupled
  • At each time
    • If the derivative at the current location recommends continuing in the same direction as before (i.e. has not changed sign from earlier):
      • Increase the step( α \alpha α > 1), and continue in the same direction
    • If the derivative has changed sign (i.e. we’ve overshot a minimum)
      • Reduce the step( β < 1 \beta < 1 β<1) and reverse direction

在这里插入图片描述

  • Features
    • It is frequently much more efficient than gradient descent
    • No convexity assumption
QuickProp
  • Newton updates

w ( k + 1 ) = w ( k ) − η A − 1 ∇ w E ( w ( k ) ) T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta \mathbf{A}^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T} w(k+1)=w(k)ηA1wE(w(k))T

  • Quickprop employs the Newton updates with two modifications

  • It treats each dimension independently

w i k + 1 = w i k − E ′ ′ ( w i k ∣ w j k , j ≠ i ) − 1 E ′ ( w i k ∣ w j k , j ≠ i ) w_{i}^{k+1}=w_{i}^{k}-E^{\prime \prime}\left(w_{i}^{k} | w_{j}^{k}, j \neq i\right)^{-1} E^{\prime}\left(w_{i}^{k} | w_{j}^{k}, j \neq i\right) wik+1=wikE(wikwjk,j=i)1E(wikwjk,j=i)

w l , i j ( k + 1 ) = w l , i j ( k ) − Δ w l , i j ( k − 1 ) E r r ′ ( w l , i j ( k ) ) − E r r ′ ( w l , i j ( k − 1 ) ) Err ⁡ ′ ( w l , i j ( k ) ) w_{l, i j}^{(k+1)}=w_{l, i j}^{(k)}-\frac{\Delta w_{l, i j}^{(k-1)}}{E r r^{\prime}\left(w_{l, i j}^{(k)}\right)-E r r^{\prime}\left(w_{l, i j}^{(k-1)}\right)} \operatorname{Err}^{\prime}\left(w_{l, i j}^{(k)}\right) wl,ij(k+1)=wl,ij(k)Err(wl,ij(k))Err(wl,ij(k1))Δwl,ij(k1)Err(wl,ij(k))

  • Features
    • Employs Newton updates with empirically derived derivatives
    • Prone to some instability for non-convex objective functions
    • But is still one of the fastest training algorithms for many problems

Momentum methods

  • Insight
    • In the direction that converges, it keeps pointing in the same direction
      • Need keep track of oscillations
      • Emphasize steps in directions that converge smoothly
    • In the direction that overshoots, it steps back and forth
      • Shrink steps in directions that bounce around
  • Maintain a running average of all past steps
    • Get longer in directions where gradient stays in the same sign
    • Become shorter in directions where the sign keep flipping
  • Update with the running average, rather than the current gradient
    • Emphasize directions of steady improvement are demonstrably superior to other methods
  • Momentum Update

Δ W ( k ) = β Δ W ( k − 1 ) − η ∇ W Loss ⁡ ( W ( k − 1 ) ) T \Delta W^{(k)}=\beta \Delta W^{(k-1)}-\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}\right)^{T} ΔW(k)=βΔW(k1)ηWLoss(W(k1))T

W ( k ) = W ( k − 1 ) + Δ W ( k ) {W^{(k)}=W^{(k-1)}+\Delta W^{(k)}} W(k)=W(k1)+ΔW(k)

  • First computes the gradient step at the current location: − η ∇ W Loss ⁡ ( W ( k − 1 ) ) T -\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}\right)^{T} ηWLoss(W(k1))T
  • Then adds in the scaled previous step: β Δ W ( k − 1 ) \beta \Delta W^{(k-1)} βΔW(k1)

在这里插入图片描述

Nestorov’s Accelerated Gradient

  • Change the order of operations

Δ W ( k ) = β Δ W ( k − 1 ) − η ∇ W Loss ⁡ ( W ( k − 1 ) + β Δ W ( k − 1 ) ) T \Delta W^{(k)}=\beta \Delta W^{(k-1)}-\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}+\beta \Delta W^{(k-1)}\right)^{T} ΔW(k)=βΔW(k1)ηWLoss(W(k1)+βΔW(k1))T

W ( k ) = W ( k − 1 ) + Δ W ( k ) {W^{(k)}=W^{(k-1)}+\Delta W^{(k)}} W(k)=W(k1)+ΔW(k)

  • First extend the previous step: β Δ W ( k − 1 ) \beta \Delta W^{(k-1)} βΔW(k1)
  • Then compute the gradient step at the resultant position: − η ∇ W Loss ⁡ ( W ( k − 1 ) + β Δ W ( k − 1 ) ) T -\eta \nabla_{W} \operatorname{Loss}\left(W^{(k-1)}+\beta \Delta W^{(k-1)}\right)^{T} ηWLoss(W(k1)+βΔW(k1))T
  • Add the two to obtain the final step
  • Converges much faster than momentum

在这里插入图片描述

Summary

  • Try the step size for all dimension is bad
    • Treat each dimension independently
  • Try to normalize curvature in all directions
    • Second order methods, e.g. Newton’s method
    • Too expensive: require inversion of a giant Hessian
  • Treat each dimension independently
    • RProp / QucikProp
    • Works, but ignores dependence between dimensions
    • Can still be too slow
  • Momentum methods which emphasize directions of steady improvement are demonstrably superior to other methods

Incremental updates

  • Batch gradient descent
    • Try to simultaneously adjust the function at all training points
    • We must process all training points before making a single adjustment
  • Stochastic gradient descent
    • Adjust the function at one training point at a time
    • A single pass through the entire training data is called an “epoch
      • An epoch over a training set with T T T samples result in T T T updates of parameters
    • We must go through them randomly to get more convergent behavior
      • Otherwise we may get cyclic behavior (hard to converge)
    • Learning rate
      • Correcting the function for individual instances will lead to never-ending, non-convergent updates (correct one, and miss the other)
      • The learning will continuously “chase” the latest sample
      • Correction for individual instances with the eventual miniscule learning rates will not modify the function

在这里插入图片描述

  • Drewbacks

    • Batch / Stochastic gradient descent is an unbiased estimate of the expected loss

    E [ Loss ⁡ ( f ( X ; W ) , g ( X ) ) ] = E [ div ⁡ ( f ( X ; W ) , g ( X ) ) ] E[\operatorname{Loss}(f(X ; W), g(X))]=E[\operatorname{div}(f(X ; W), g(X))] E[Loss(f(X;W),g(X))]=E[div(f(X;W),g(X))]

    • But the variance of the empirical risk in batch gradient is 1 N \frac{1}{N} N1 times compared to stochastic gradient descent
    • Like using 1 N ∑ X \frac{1}{N}\sum{X} N1X and X i X_i Xi to estimate X ˉ \bar{X} Xˉ
  • Mini-batch gradient descent

    • Adjust the function at a small, randomly chosen subset of points
    • Also an unbiased estimate of the expected error, and the variance is relatively small compared to SGD
    • The mini-batch size is a hyper parameter to be optimized
    • Convergence depends on learning rate
      • Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10)
      • Advanced methods: Adaptive updates, where the learning rate is itself determined as part of the estimation

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值