CMU 11-785 L07 Optimizers and regularizers

本文深入探讨了优化器在神经网络训练中的作用,包括动量、RMSProp、Adam等算法的工作原理及其优劣。同时,介绍了批量归一化、梯度裁剪和数据增强等提高网络收敛性和性能的技术。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Optimizers

  • Momentum and Nestorov’s method improve convergence by normalizing the mean (first moment) of the derivatives
  • Considering the second moments
    • RMS Prop / Adagrad / AdaDelta / ADAM1
  • Simple gradient and momentum methods still demonstrate oscillatory behavior in some directions2
    • Depends on magic step size parameters (learning rate)
  • Need to dampen step size in directions with high motion
    • Second order term (use variation to smooth it)
    • Scale down updates with large mean squared derivatives
    • scale up updates with small mean squared derivatives

RMS Prop

  • Notion
    • The squared derivative is ∂ w 2 D = ( ∂ w D ) 2 \partial_{w}^{2} D=\left(\partial_{w} D\right)^{2} w2D=(wD)2
    • The mean squared derivative is E [ ∂ W 2 D ] E\left[\partial_{W}^{2} D\right] E[W2D]
  • This is a variant on the basic mini-batch SGD algorithm
    • Updates are by parameter

E [ ∂ w 2 D ] k = γ E [ ∂ w 2 D ] k − 1 + ( 1 − γ ) ( ∂ w 2 D ) k E\left[\partial_{w}^{2} D\right]_{k}=\gamma E\left[\partial_{w}^{2} D\right]_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k} E[w2D]k=γE[w2D]k1+(1γ)(w2D)k

w k + 1 = w k − η E [ ∂ w 2 D ] k + ϵ ∂ w D w_{k+1}=w_{k}-\frac{\eta}{\sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon}} \partial_{w} D wk+1=wkE[w2D]k+ϵ ηwD

  • If using the same step over a long period, E [ ∂ w 2 D ] k + ϵ ≈ ∣ ∂ w D ∣ \sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon} \approx |\partial_{w} D| E[w2D]k+ϵ wD
    • So w k + 1 = w k − sign ( ∂ w D ) η w_{k+1}=w_{k}-\text{sign} (\partial_{w} D )\eta wk+1=wksign(wD)η
    • Only the sign remain, similar to RProp

Adam

  • RMS prop only considers a second-moment normalized version of the current gradient
  • ADAM utilizes a smoothed version of the momentum-augmented gradient
    • Considers both first and second moments

m k = δ m k − 1 + ( 1 − δ ) ( ∂ w D ) k m_{k}=\delta m_{k-1}+(1-\delta)\left(\partial_{w} D\right)_{k} mk=δmk1+(1δ)(wD)k

v k = γ v k − 1 + ( 1 − γ ) ( ∂ w 2 D ) k v_{k}=\gamma v_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k} vk=γvk1+(1γ)(w2D)k

m k ^ = m k 1 − δ k , v ^ k = v k 1 − γ k \hat{m_k}=\frac{m_{k}}{1-\delta^{k}}, \quad \quad \hat{v}_{k}=\frac{v_{k}}{1-\gamma^{k}} mk^=1δkmk,v^k=1γkvk

w k + 1 = w k − η v ^ k + ϵ m ^ k w_{k+1}=w_{k}-\frac{\eta}{\sqrt{\hat{v}_{k}+\epsilon}} \hat{m}_{k} wk+1=wkv^k+ϵ ηm^k

  • Typically δ ≈ 1 \delta \approx 1 δ1, initalize $m_{k-1}, v_{k-1} \approx 0 $, so 1 − δ ≈ 0 1- \delta \approx 0 1δ0, will be very slow to update in the beginning
    • So we need m k ^ = m k 1 − δ k \hat{m_k}=\frac{m_{k}}{1-\delta^{k}} mk^=1δkmk term to scale up in the beginning

Tricks

  • To make the network converge better, we can consider the following aspects
    • The Divergence
    • Dropout
    • Batch normalization
    • Gradient clipping
    • Data augmentation

Divergence

  • What shape do we want the divergence function would be?
    • Must be smooth and not have many poor local optima
    • The best type of divergence is steep far from the optimum, but shallow at the optimum
      • But not too shallow(hard to converge to minimum)
  • The choice of divergence affects both the learned network and results

在这里插入图片描述

  • Common choices

    • L2 divergence

    D i v = 1 2 ∑ i ( y i − d i ) 2 D i v=\frac{1}{2} \sum_{i}\left(y_{i}-d_{i}\right)^{2} Div=21i(yidi)2

    • KL divergence

    D i v = ∑ i d i log ⁡ ( d i ) − ∑ i d i log ⁡ ( y i ) D i v=\sum_{i} d_{i} \log \left(d_{i}\right)-\sum_{i} d_{i} \log \left(y_{i}\right) Div=idilog(di)idilog(yi)

  • L2 is particularly appropriate when attempting to perform regression

    • Numeric prediction
    • For L2 divergence the derivative w.r.t. the pre-activation of the output layer is :
      • ∇ z 1 2 ∥ y − d ∥ 2 = ( y − d ) J y ( z ) \nabla_{z} \frac{1}{2}\|y-d\|^{2}=(y-d) J_{y}(z) z21yd2=(yd)Jy(z)
    • We literally “propagate” the error ( y − d ) (y-d) (yd) backward
    • Which is why the method is sometimes called “error backpropagation
  • The KL divergence is better when the intent is classification

    • The output is a probability vector

Batch normalization

  • Covariate shifts problem

    • Training assumes the training data are all similarly distributed (So as mini-batch)
    • In practice, each minibatch may have a different distribution
    • Which may occur in each layer of the network
    • Minimize one batch cannot give the correction of other batches
  • Solution

    • Move all batches to have a mean of 0 and unit standard deviation
    • Eliminates covariate shift between batches
  • Batch normalization is a covariate adjustment unit that happens after the weighted addition of inputs (affine combination) but before the application of activation 3

  • Steps

    • Covariate shift to standard position

    u i = z i − μ B σ B 2 + ϵ u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} ui=σB2+ϵ ziμB

    • Shift to right position

    z i ^ = γ u i + β \hat{z_i} = \gamma u_i + \beta zi^=γui+β

Backpropagation
  • The outputs are now functions of μ B \mu_B μB and σ B 2 \sigma_B^2 σB2 which are functions of the entire minibatch

Div ⁡ ( M B ) = 1 B ∑ t Div ⁡ ( Y t ( X t , μ B , σ B 2 ) , d t ( X t ) ) \operatorname{Div}(M B)=\frac{1}{B} \sum_{t} \operatorname{Div}\left(Y_{t}\left(X_{t}, \mu_{B}, \sigma_{B}^{2}\right), d_{t}\left(X_{t}\right)\right) Div(MB)=B1tDiv(Yt(Xt,μB,σB2),dt(Xt))

  • The divergence for each Y t Y_t Yt depends on all the X t X_t Xt within the mini-batch
    • Is a vector function over the mini-batch
  • Using influence diagram to caculate derivatives4

在这里插入图片描述

  • Goal

    • We need to caculate the learnable parameters d D i v γ , d D i v β \frac{d D i v}{\gamma}, \frac{d D i v}{\beta} γdDiv,βdDiv, and the affine combination d D i v z i \frac{d D i v}{z_i} zidDiv

    ∂ D i v ∂ z i = ∂ D i v ∂ u i ⋅ ∂ u i ∂ z i + ∂ D i v ∂ σ B 2 ⋅ ∂ σ B 2 ∂ z i + ∂ D i v ∂ μ B ⋅ ∂ μ B ∂ z i \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}} ziDiv=uiDivziui+σB2DivziσB2+μBDivziμB

    • So we need extra $\frac{\partial D i v}{\partial u_{i}},\frac{\partial D i v}{\partial \sigma_{B}^{2}},\frac{\partial D i v}{\partial \mu_{B}} $
  • Preparation

μ B = 1 B ∑ i = 1 B z i σ B 2 = 1 B ∑ i = 1 B ( z i − μ B ) 2 \mu_{B}=\frac{1}{B} \sum_{i=1}^{B} z_{i}\quad \quad \sigma_{B}^{2}=\frac{1}{B} \sum_{i=1}^{B}\left(z_{i}-\mu_{B}\right)^{2} μB=B1i=1BziσB2=B1i=1B(ziμB)2

u i = z i − μ B σ B 2 + ϵ z i ^ = γ u i + β u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} \quad \quad \hat{z_i} = \gamma u_i + \beta ui=σB2+ϵ ziμBzi^=γui+β

  • For the first term ∂ D i v ∂ u i ⋅ ∂ u i ∂ z i \frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}} uiDivziui

    • First caculate d D i v γ , d D i v β \frac{d D i v}{\gamma}, \frac{d D i v}{\beta} γdDiv,βdDiv
      d D i v d β = d D i v d z ^ d D i v d γ = u d D i v d z ^ \frac{d D i v}{d \beta}=\frac{d D i v}{d \hat{z}} \quad \quad \frac{d D i v}{d \gamma}=u \frac{d D i v}{d \hat{z}} dβdDiv=dz^dDivdγdDiv=udz^dDiv

    • ∂ u i ∂ z i = 1 σ B 2 + ϵ \frac{\partial u_{i}}{\partial z_{i}} = \frac{1}{\sqrt{\sigma^2_B +\epsilon}} ziui=σB2+ϵ 1, so the first term = ∂ D i v ∂ u i ⋅ 1 σ B 2 + ϵ \frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}} uiDivσB2+ϵ 1

  • For the second term ∂ D i v ∂ σ B 2 ⋅ ∂ σ B 2 ∂ z i \frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}} σB2DivziσB2

    • Caculate ∂ D i v ∂ σ B 2 \frac{\partial D i v}{\partial \sigma_{B}^{2}} σB2Div

    ∂ D i v ∂ σ B 2 = ∑ ∂ D i v ∂ u i ∂ u i ∂ σ B 2 \frac{\partial Div}{\partial \sigma_{B}^{2}}=\sum \frac{\partial Div}{\partial u_{i}} \frac{\partial u_{i}}{\partial \sigma_{B}^{2}} σB2Div=uiDivσB2ui

    ∂ D i v ∂ σ B 2 = − 1 2 ( σ B 2 + ϵ ) − 3 / 2 ∑ i = 1 B ∂ D i v ∂ u i ( z i − μ B ) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right) σB2Div=21(σB2+ϵ)3/2i=1BuiDiv(ziμB)

    • And ∂ σ B 2 ∂ z i \frac{\partial \sigma_{B}^{2}}{\partial z_{i}} ziσB2

    ∂ σ B 2 ∂ z i = 2 ( z i − μ B ) B \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}=\frac{2\left(z_{i}-\mu_{B}\right)}{B} ziσB2=B2(ziμB)

    • So the second term = ∂ D i v ∂ σ B 2 ⋅ 2 ( z i − μ B ) B \frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B} σB2DivB2(ziμB)
  • Finally for the third term ∂ D i v ∂ μ B ⋅ ∂ μ B ∂ z i \frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}} μBDivziμB

    • Caculate ∂ D i v ∂ μ B \frac{\partial D i v}{\partial \mu_{B}} μBDiv

    ∂ D i v ∂ μ B = ∑ ∂ D i v ∂ μ i ∂ μ i ∂ μ B + ∂ D i v ∂ σ B 2 ∂ σ B 2 ∂ μ B \frac{\partial D i v}{\partial \mu_B}=\sum \frac{\partial Div}{\partial \mu_{i}} \frac{\partial \mu_{i}}{\partial \mu_{B}}+\frac{\partial Div}{\partial\sigma_{B}^{2} } \frac{\partial \sigma_{B}^{2}}{\partial \mu_{B}} μBDiv=μiDivμBμi+σB2DivμBσB2

    ∂ D i v ∂ μ B = ( ∑ i = 1 B ∂ D i v ∂ u i ⋅ − 1 σ B 2 + ϵ ) + ∂ D i v ∂ σ B 2 ⋅ ∑ i = 1 B − 2 ( z i − μ B ) B \frac{\partial D i v}{\partial \mu_{B}}=\left(\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}} \cdot \frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\right)+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\sum_{i=1}^{B}-2\left(z_{i}-\mu_{B}\right)}{B} μBDiv=(i=1BuiDivσB2+ϵ 1)+σB2DivBi=1B2(ziμB)

    • The last term is zero, and because μ z = 1 B ∑ z i \mu_z = \frac{1}{B} \sum z_i μz=B1zi

    ∂ μ B ∂ z i = 1 B \frac{\partial \mu_{B}}{\partial z_{i}}=\frac{1}{B} ziμB=B1

    • So the third term = ∂ D i v ∂ μ B ⋅ 1 B \frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B} μBDivB1
  • Overall

∂ D i v ∂ z i = ∂ D i v ∂ u i ⋅ 1 σ B 2 + ϵ + ∂ D i v ∂ σ B 2 ⋅ 2 ( z i − μ B ) B + ∂ D i v ∂ μ B ⋅ 1 B \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B} ziDiv=uiDivσB2+ϵ 1+σB2DivB2(ziμB)+μBDivB1

∂ D i v ∂ σ B 2 = − 1 2 ( σ B 2 + ϵ ) − 3 / 2 ∑ i = 1 B ∂ D i v ∂ u i ( z i − μ B ) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right) σB2Div=21(σB2+ϵ)3/2i=1BuiDiv(ziμB)

∂ D i v ∂ μ B = − 1 σ B 2 + ϵ ∑ i = 1 B ∂ D i v ∂ u i \frac{\partial D i v}{\partial \mu_{B}}=\frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}} μBDiv=σB2+ϵ 1i=1BuiDiv

Inference
  • On test data, BN requires μ B \mu_B μB and σ B 2 \sigma_B^2 σB2
  • We will use the average over all training minibatches

μ B N = 1 Nbatches ∑ b a t μ B ( batch ) \mu_{B N}=\frac{1}{\text {Nbatches}} \sum_{b a t} \mu_{B}(\text {batch}) μBN=Nbatches1batμB(batch)

σ B N 2 = B ( B − 1 ) N b a t c h e s ∑ b a t c h σ B 2 ( batch ) \sigma_{B N}^{2}=\frac{B}{(B-1) N b a t c h e s} \sum_{b a t c h} \sigma_{B}^{2}(\text {batch}) σBN2=(B1)NbatchesBbatchσB2(batch)

  • Note: these are neuron-specific
    • μ B ( b a t c h ) , σ B b a t c h \mu_B(batch), \sigma_B{batch} μB(batch),σBbatch are obtained from the final converged network
    • The 𝐵/(𝐵 − 1) term gives us an unbiased estimator for the variance
What can it do
  • Improves both convergence rate and neural network performance
    • Anecdotal evidence that BN eliminates the need for dropout
  • To get maximum benefit from BN, learning rates must be increased and learning rate decay can be faster
    • Since the data generally remain in the high-gradient regions of the activations
    • e.g. For sigmoid function, move data to the linear part, the gradient is high
  • Also needs better randomization of training data order

Smoothness

  • Smoothness through network structure
    • MLPs are universal approximators
    • For a given number of parameters, deeper networks impose more smoothness than shallow&wide ones
    • Each layer restricts the shape of the function
  • Smoothness through weight constrain

Regularizer

  • The "desired” output is generally smooth
    • Capture statistical or average trends
  • Overfitting
    • But an unconstrained model will model individual instances instead
    • Why overfitting?5
  • Using a sigmoid activation, as ∣ w ∣ |w| w increases, the response becomes steeper

在这里插入图片描述

  • Constraining the weights to be low will force slower perceptrons and smoother output response
  • Regularized training: minimize the loss while also minimizing the weights

L ( W 1 , W 2 , … , W K ) = Loss ⁡ ( W 1 , W 2 , … , W K ) + 1 2 λ ∑ k ∥ W k ∥ 2 2 L\left(W_{1}, W_{2}, \ldots, W_{K}\right)=\operatorname{Loss}\left(W_{1}, W_{2}, \ldots, W_{K}\right)+\frac{1}{2} \lambda \sum_{k}\left\|W_{k}\right\|_{2}^{2} L(W1,W2,,WK)=Loss(W1,W2,,WK)+21λkWk22

  • λ \lambda λ is the regularization parameter whose value depends on how important it is for us to want to minimize the weights
  • Increasing assigns greater importance to shrinking the weights
    • Make greater error on training data, to obtain a more acceptable network

Dropout

  • “Dropout” is a stochastic data/model erasure method that sometimes forces the network to learn more robust models

  • Bagging method

    • Using ensemble classifiers to improve prediction
  • Dropout

    • For each input, at each iteration, “turn off” each neuron with a probability 1 − α 1-\alpha 1α
    • Also turn off inputs similarly
  • Backpropagation is effectively performed only over the remaining network

    • The effective network is different for different inputs
    • Effectively learns a network that averages over all possible networks (Bagging)
  • Dropout as a mechanism to increase pattern density

    • Dropout forces the neurons to learn “rich” and redundant patterns
    • E.g. without dropout, a noncompressive layer may just “clone” its input to its output
    • Transferring the task of learning to the rest of the network upstream
Implementation
  • The expected output of the neuron is
    y i ( k ) = α σ ( ∑ j w j i ( k ) y j ( k − 1 ) + b i ( k ) ) y_{i}^{(k)}=\alpha \sigma\left(\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)}\right) yi(k)=ασ(jwji(k)yj(k1)+bi(k))

  • During test, push the a to all outgoing weights

z i ( k ) = ∑ j w j i ( k ) y j ( k − 1 ) + b i ( k ) = ∑ j w j i ( k ) α σ ( z j ( k − 1 ) ) + b i ( k ) = ∑ j ( α w j i ( k ) ) σ ( z j ( k − 1 ) ) + b i ( k ) \begin{aligned} z_{i}^{(k)} &=\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)} \\\\ &=\sum_{j} w_{j i}^{(k)} \alpha \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \\\\ &=\sum_{j}\left(\alpha w_{j i}^{(k)}\right) \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \end{aligned} zi(k)=jwji(k)yj(k1)+bi(k)=jwji(k)ασ(zj(k1))+bi(k)=j(αwji(k))σ(zj(k1))+bi(k)

  • So W t e s t = α W t r a i n e d W_{test} = \alpha W_{trained} Wtest=αWtrained
    • Instead of multiplying every output by all weights by α \alpha α, multiply all weight by α \alpha α
  • Alternate implementation
    • During training, replace the activation of all neurons in the network by α − 1 σ ( . ) \alpha ^{-1} \sigma(.) α1σ(.)
    • Use σ ( . ) \sigma(.) σ(.) as the activation during testing, and not modify the weights

在这里插入图片描述

More tricks

  • Obtain training data
    • Use appropriate representation for inputs and outputs
    • Data Augmentation
  • Choose network architecture
    • More neurons need more data
    • Deep is better, but harder to train
  • Choose the appropriate divergence function
    • Choose regularization
  • Choose heuristics
    • batch norm, dropout
  • Choose optimization algorithm
    • Adagrad / Adam / SGD
  • Perform a grid search for hyper parameters (learning rate, regularization parameter, …) on held-out data
  • Train
    • Evaluate periodically on validation data, for early stopping if required

  1. A good summary of recent optimizers can be seen in here. ↩︎

  2. Animations for optimization algorithms ↩︎

  3. Batch normalization in Neural Networks ↩︎

  4. A simple and clear demostration of 2 variables in a single network ↩︎

  5. The perceptrons in the network are individually capable of sharp changes in output ↩︎

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值