Optimizers
- Momentum and Nestorov’s method improve convergence by normalizing the mean (first moment) of the derivatives
- Considering the second moments
- RMS Prop / Adagrad / AdaDelta / ADAM1
- Simple gradient and momentum methods still demonstrate oscillatory behavior in some directions2
- Depends on magic step size parameters (learning rate)
- Need to dampen step size in directions with high motion
- Second order term (use variation to smooth it)
- Scale down updates with large mean squared derivatives
- scale up updates with small mean squared derivatives
RMS Prop
- Notion
- The squared derivative is ∂ w 2 D = ( ∂ w D ) 2 \partial_{w}^{2} D=\left(\partial_{w} D\right)^{2} ∂w2D=(∂wD)2
- The mean squared derivative is E [ ∂ W 2 D ] E\left[\partial_{W}^{2} D\right] E[∂W2D]
- This is a variant on the basic mini-batch SGD algorithm
- Updates are by parameter
E [ ∂ w 2 D ] k = γ E [ ∂ w 2 D ] k − 1 + ( 1 − γ ) ( ∂ w 2 D ) k E\left[\partial_{w}^{2} D\right]_{k}=\gamma E\left[\partial_{w}^{2} D\right]_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k} E[∂w2D]k=γE[∂w2D]k−1+(1−γ)(∂w2D)k
w k + 1 = w k − η E [ ∂ w 2 D ] k + ϵ ∂ w D w_{k+1}=w_{k}-\frac{\eta}{\sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon}} \partial_{w} D wk+1=wk−E[∂w2D]k+ϵη∂wD
- If using the same step over a long period,
E
[
∂
w
2
D
]
k
+
ϵ
≈
∣
∂
w
D
∣
\sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon} \approx |\partial_{w} D|
E[∂w2D]k+ϵ≈∣∂wD∣
- So w k + 1 = w k − sign ( ∂ w D ) η w_{k+1}=w_{k}-\text{sign} (\partial_{w} D )\eta wk+1=wk−sign(∂wD)η
- Only the sign remain, similar to RProp
Adam
- RMS prop only considers a second-moment normalized version of the current gradient
- ADAM utilizes a smoothed version of the momentum-augmented gradient
- Considers both first and second moments
m k = δ m k − 1 + ( 1 − δ ) ( ∂ w D ) k m_{k}=\delta m_{k-1}+(1-\delta)\left(\partial_{w} D\right)_{k} mk=δmk−1+(1−δ)(∂wD)k
v k = γ v k − 1 + ( 1 − γ ) ( ∂ w 2 D ) k v_{k}=\gamma v_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k} vk=γvk−1+(1−γ)(∂w2D)k
m k ^ = m k 1 − δ k , v ^ k = v k 1 − γ k \hat{m_k}=\frac{m_{k}}{1-\delta^{k}}, \quad \quad \hat{v}_{k}=\frac{v_{k}}{1-\gamma^{k}} mk^=1−δkmk,v^k=1−γkvk
w k + 1 = w k − η v ^ k + ϵ m ^ k w_{k+1}=w_{k}-\frac{\eta}{\sqrt{\hat{v}_{k}+\epsilon}} \hat{m}_{k} wk+1=wk−v^k+ϵηm^k
- Typically
δ
≈
1
\delta \approx 1
δ≈1, initalize $m_{k-1}, v_{k-1} \approx 0 $, so
1
−
δ
≈
0
1- \delta \approx 0
1−δ≈0, will be very slow to update in the beginning
- So we need m k ^ = m k 1 − δ k \hat{m_k}=\frac{m_{k}}{1-\delta^{k}} mk^=1−δkmk term to scale up in the beginning
Tricks
- To make the network converge better, we can consider the following aspects
- The Divergence
- Dropout
- Batch normalization
- Gradient clipping
- Data augmentation
Divergence
- What shape do we want the divergence function would be?
- Must be smooth and not have many poor local optima
- The best type of divergence is steep far from the optimum, but shallow at the optimum
- But not too shallow(hard to converge to minimum)
- The choice of divergence affects both the learned network and results
-
Common choices
- L2 divergence
D i v = 1 2 ∑ i ( y i − d i ) 2 D i v=\frac{1}{2} \sum_{i}\left(y_{i}-d_{i}\right)^{2} Div=21i∑(yi−di)2
- KL divergence
D i v = ∑ i d i log ( d i ) − ∑ i d i log ( y i ) D i v=\sum_{i} d_{i} \log \left(d_{i}\right)-\sum_{i} d_{i} \log \left(y_{i}\right) Div=i∑dilog(di)−i∑dilog(yi)
-
L2 is particularly appropriate when attempting to perform regression
- Numeric prediction
- For L2 divergence the derivative w.r.t. the pre-activation of the output layer is :
- ∇ z 1 2 ∥ y − d ∥ 2 = ( y − d ) J y ( z ) \nabla_{z} \frac{1}{2}\|y-d\|^{2}=(y-d) J_{y}(z) ∇z21∥y−d∥2=(y−d)Jy(z)
- We literally “propagate” the error ( y − d ) (y-d) (y−d) backward
- Which is why the method is sometimes called “error backpropagation”
-
The KL divergence is better when the intent is classification
- The output is a probability vector
Batch normalization
-
Covariate shifts problem
- Training assumes the training data are all similarly distributed (So as mini-batch)
- In practice, each minibatch may have a different distribution
- Which may occur in each layer of the network
- Minimize one batch cannot give the correction of other batches
-
Solution
- Move all batches to have a mean of 0 and unit standard deviation
- Eliminates covariate shift between batches
-
Batch normalization is a covariate adjustment unit that happens after the weighted addition of inputs (affine combination) but before the application of activation 3
-
Steps
- Covariate shift to standard position
u i = z i − μ B σ B 2 + ϵ u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} ui=σB2+ϵzi−μB
- Shift to right position
z i ^ = γ u i + β \hat{z_i} = \gamma u_i + \beta zi^=γui+β
Backpropagation
- The outputs are now functions of μ B \mu_B μB and σ B 2 \sigma_B^2 σB2 which are functions of the entire minibatch
Div ( M B ) = 1 B ∑ t Div ( Y t ( X t , μ B , σ B 2 ) , d t ( X t ) ) \operatorname{Div}(M B)=\frac{1}{B} \sum_{t} \operatorname{Div}\left(Y_{t}\left(X_{t}, \mu_{B}, \sigma_{B}^{2}\right), d_{t}\left(X_{t}\right)\right) Div(MB)=B1t∑Div(Yt(Xt,μB,σB2),dt(Xt))
- The divergence for each
Y
t
Y_t
Yt depends on all the
X
t
X_t
Xt within the mini-batch
- Is a vector function over the mini-batch
- Using influence diagram to caculate derivatives4
-
Goal
- We need to caculate the learnable parameters d D i v γ , d D i v β \frac{d D i v}{\gamma}, \frac{d D i v}{\beta} γdDiv,βdDiv, and the affine combination d D i v z i \frac{d D i v}{z_i} zidDiv
∂ D i v ∂ z i = ∂ D i v ∂ u i ⋅ ∂ u i ∂ z i + ∂ D i v ∂ σ B 2 ⋅ ∂ σ B 2 ∂ z i + ∂ D i v ∂ μ B ⋅ ∂ μ B ∂ z i \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}} ∂zi∂Div=∂ui∂Div⋅∂zi∂ui+∂σB2∂Div⋅∂zi∂σB2+∂μB∂Div⋅∂zi∂μB
- So we need extra $\frac{\partial D i v}{\partial u_{i}},\frac{\partial D i v}{\partial \sigma_{B}^{2}},\frac{\partial D i v}{\partial \mu_{B}} $
-
Preparation
μ B = 1 B ∑ i = 1 B z i σ B 2 = 1 B ∑ i = 1 B ( z i − μ B ) 2 \mu_{B}=\frac{1}{B} \sum_{i=1}^{B} z_{i}\quad \quad \sigma_{B}^{2}=\frac{1}{B} \sum_{i=1}^{B}\left(z_{i}-\mu_{B}\right)^{2} μB=B1i=1∑BziσB2=B1i=1∑B(zi−μB)2
u i = z i − μ B σ B 2 + ϵ z i ^ = γ u i + β u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} \quad \quad \hat{z_i} = \gamma u_i + \beta ui=σB2+ϵzi−μBzi^=γui+β
-
For the first term ∂ D i v ∂ u i ⋅ ∂ u i ∂ z i \frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}} ∂ui∂Div⋅∂zi∂ui
-
First caculate d D i v γ , d D i v β \frac{d D i v}{\gamma}, \frac{d D i v}{\beta} γdDiv,βdDiv
d D i v d β = d D i v d z ^ d D i v d γ = u d D i v d z ^ \frac{d D i v}{d \beta}=\frac{d D i v}{d \hat{z}} \quad \quad \frac{d D i v}{d \gamma}=u \frac{d D i v}{d \hat{z}} dβdDiv=dz^dDivdγdDiv=udz^dDiv -
∂ u i ∂ z i = 1 σ B 2 + ϵ \frac{\partial u_{i}}{\partial z_{i}} = \frac{1}{\sqrt{\sigma^2_B +\epsilon}} ∂zi∂ui=σB2+ϵ1, so the first term = ∂ D i v ∂ u i ⋅ 1 σ B 2 + ϵ \frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}} ∂ui∂Div⋅σB2+ϵ1
-
-
For the second term ∂ D i v ∂ σ B 2 ⋅ ∂ σ B 2 ∂ z i \frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}} ∂σB2∂Div⋅∂zi∂σB2
- Caculate ∂ D i v ∂ σ B 2 \frac{\partial D i v}{\partial \sigma_{B}^{2}} ∂σB2∂Div
∂ D i v ∂ σ B 2 = ∑ ∂ D i v ∂ u i ∂ u i ∂ σ B 2 \frac{\partial Div}{\partial \sigma_{B}^{2}}=\sum \frac{\partial Div}{\partial u_{i}} \frac{\partial u_{i}}{\partial \sigma_{B}^{2}} ∂σB2∂Div=∑∂ui∂Div∂σB2∂ui
∂ D i v ∂ σ B 2 = − 1 2 ( σ B 2 + ϵ ) − 3 / 2 ∑ i = 1 B ∂ D i v ∂ u i ( z i − μ B ) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right) ∂σB2∂Div=2−1(σB2+ϵ)−3/2i=1∑B∂ui∂Div(zi−μB)
- And ∂ σ B 2 ∂ z i \frac{\partial \sigma_{B}^{2}}{\partial z_{i}} ∂zi∂σB2
∂ σ B 2 ∂ z i = 2 ( z i − μ B ) B \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}=\frac{2\left(z_{i}-\mu_{B}\right)}{B} ∂zi∂σB2=B2(zi−μB)
- So the second term = ∂ D i v ∂ σ B 2 ⋅ 2 ( z i − μ B ) B \frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B} ∂σB2∂Div⋅B2(zi−μB)
-
Finally for the third term ∂ D i v ∂ μ B ⋅ ∂ μ B ∂ z i \frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}} ∂μB∂Div⋅∂zi∂μB
- Caculate ∂ D i v ∂ μ B \frac{\partial D i v}{\partial \mu_{B}} ∂μB∂Div
∂ D i v ∂ μ B = ∑ ∂ D i v ∂ μ i ∂ μ i ∂ μ B + ∂ D i v ∂ σ B 2 ∂ σ B 2 ∂ μ B \frac{\partial D i v}{\partial \mu_B}=\sum \frac{\partial Div}{\partial \mu_{i}} \frac{\partial \mu_{i}}{\partial \mu_{B}}+\frac{\partial Div}{\partial\sigma_{B}^{2} } \frac{\partial \sigma_{B}^{2}}{\partial \mu_{B}} ∂μB∂Div=∑∂μi∂Div∂μB∂μi+∂σB2∂Div∂μB∂σB2
∂ D i v ∂ μ B = ( ∑ i = 1 B ∂ D i v ∂ u i ⋅ − 1 σ B 2 + ϵ ) + ∂ D i v ∂ σ B 2 ⋅ ∑ i = 1 B − 2 ( z i − μ B ) B \frac{\partial D i v}{\partial \mu_{B}}=\left(\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}} \cdot \frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\right)+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\sum_{i=1}^{B}-2\left(z_{i}-\mu_{B}\right)}{B} ∂μB∂Div=(i=1∑B∂ui∂Div⋅σB2+ϵ−1)+∂σB2∂Div⋅B∑i=1B−2(zi−μB)
- The last term is zero, and because μ z = 1 B ∑ z i \mu_z = \frac{1}{B} \sum z_i μz=B1∑zi
∂ μ B ∂ z i = 1 B \frac{\partial \mu_{B}}{\partial z_{i}}=\frac{1}{B} ∂zi∂μB=B1
- So the third term = ∂ D i v ∂ μ B ⋅ 1 B \frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B} ∂μB∂Div⋅B1
-
Overall
∂ D i v ∂ z i = ∂ D i v ∂ u i ⋅ 1 σ B 2 + ϵ + ∂ D i v ∂ σ B 2 ⋅ 2 ( z i − μ B ) B + ∂ D i v ∂ μ B ⋅ 1 B \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B} ∂zi∂Div=∂ui∂Div⋅σB2+ϵ1+∂σB2∂Div⋅B2(zi−μB)+∂μB∂Div⋅B1
∂ D i v ∂ σ B 2 = − 1 2 ( σ B 2 + ϵ ) − 3 / 2 ∑ i = 1 B ∂ D i v ∂ u i ( z i − μ B ) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right) ∂σB2∂Div=2−1(σB2+ϵ)−3/2i=1∑B∂ui∂Div(zi−μB)
∂ D i v ∂ μ B = − 1 σ B 2 + ϵ ∑ i = 1 B ∂ D i v ∂ u i \frac{\partial D i v}{\partial \mu_{B}}=\frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}} ∂μB∂Div=σB2+ϵ−1i=1∑B∂ui∂Div
Inference
- On test data, BN requires μ B \mu_B μB and σ B 2 \sigma_B^2 σB2
- We will use the average over all training minibatches
μ B N = 1 Nbatches ∑ b a t μ B ( batch ) \mu_{B N}=\frac{1}{\text {Nbatches}} \sum_{b a t} \mu_{B}(\text {batch}) μBN=Nbatches1bat∑μB(batch)
σ B N 2 = B ( B − 1 ) N b a t c h e s ∑ b a t c h σ B 2 ( batch ) \sigma_{B N}^{2}=\frac{B}{(B-1) N b a t c h e s} \sum_{b a t c h} \sigma_{B}^{2}(\text {batch}) σBN2=(B−1)NbatchesBbatch∑σB2(batch)
- Note: these are neuron-specific
- μ B ( b a t c h ) , σ B b a t c h \mu_B(batch), \sigma_B{batch} μB(batch),σBbatch are obtained from the final converged network
- The 𝐵/(𝐵 − 1) term gives us an unbiased estimator for the variance
What can it do
- Improves both convergence rate and neural network performance
- Anecdotal evidence that BN eliminates the need for dropout
- To get maximum benefit from BN, learning rates must be increased and learning rate decay can be faster
- Since the data generally remain in the high-gradient regions of the activations
- e.g. For sigmoid function, move data to the linear part, the gradient is high
- Also needs better randomization of training data order
Smoothness
- Smoothness through network structure
- MLPs are universal approximators
- For a given number of parameters, deeper networks impose more smoothness than shallow&wide ones
- Each layer restricts the shape of the function
- Smoothness through weight constrain
Regularizer
- The "desired” output is generally smooth
- Capture statistical or average trends
- Overfitting
- But an unconstrained model will model individual instances instead
- Why overfitting?5
- Using a sigmoid activation, as ∣ w ∣ |w| ∣w∣ increases, the response becomes steeper
- Constraining the weights to be low will force slower perceptrons and smoother output response
- Regularized training: minimize the loss while also minimizing the weights
L ( W 1 , W 2 , … , W K ) = Loss ( W 1 , W 2 , … , W K ) + 1 2 λ ∑ k ∥ W k ∥ 2 2 L\left(W_{1}, W_{2}, \ldots, W_{K}\right)=\operatorname{Loss}\left(W_{1}, W_{2}, \ldots, W_{K}\right)+\frac{1}{2} \lambda \sum_{k}\left\|W_{k}\right\|_{2}^{2} L(W1,W2,…,WK)=Loss(W1,W2,…,WK)+21λk∑∥Wk∥22
- λ \lambda λ is the regularization parameter whose value depends on how important it is for us to want to minimize the weights
- Increasing assigns greater importance to shrinking the weights
- Make greater error on training data, to obtain a more acceptable network
Dropout
-
“Dropout” is a stochastic data/model erasure method that sometimes forces the network to learn more robust models
-
Bagging method
- Using ensemble classifiers to improve prediction
-
Dropout
- For each input, at each iteration, “turn off” each neuron with a probability 1 − α 1-\alpha 1−α
- Also turn off inputs similarly
-
Backpropagation is effectively performed only over the remaining network
- The effective network is different for different inputs
- Effectively learns a network that averages over all possible networks (Bagging)
-
Dropout as a mechanism to increase pattern density
- Dropout forces the neurons to learn “rich” and redundant patterns
- E.g. without dropout, a noncompressive layer may just “clone” its input to its output
- Transferring the task of learning to the rest of the network upstream
Implementation
-
The expected output of the neuron is
y i ( k ) = α σ ( ∑ j w j i ( k ) y j ( k − 1 ) + b i ( k ) ) y_{i}^{(k)}=\alpha \sigma\left(\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)}\right) yi(k)=ασ(j∑wji(k)yj(k−1)+bi(k)) -
During test, push the a to all outgoing weights
z i ( k ) = ∑ j w j i ( k ) y j ( k − 1 ) + b i ( k ) = ∑ j w j i ( k ) α σ ( z j ( k − 1 ) ) + b i ( k ) = ∑ j ( α w j i ( k ) ) σ ( z j ( k − 1 ) ) + b i ( k ) \begin{aligned} z_{i}^{(k)} &=\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)} \\\\ &=\sum_{j} w_{j i}^{(k)} \alpha \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \\\\ &=\sum_{j}\left(\alpha w_{j i}^{(k)}\right) \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \end{aligned} zi(k)=j∑wji(k)yj(k−1)+bi(k)=j∑wji(k)ασ(zj(k−1))+bi(k)=j∑(αwji(k))σ(zj(k−1))+bi(k)
- So
W
t
e
s
t
=
α
W
t
r
a
i
n
e
d
W_{test} = \alpha W_{trained}
Wtest=αWtrained
- Instead of multiplying every output by all weights by α \alpha α, multiply all weight by α \alpha α
- Alternate implementation
- During training, replace the activation of all neurons in the network by α − 1 σ ( . ) \alpha ^{-1} \sigma(.) α−1σ(.)
- Use σ ( . ) \sigma(.) σ(.) as the activation during testing, and not modify the weights
More tricks
- Obtain training data
- Use appropriate representation for inputs and outputs
- Data Augmentation
- Choose network architecture
- More neurons need more data
- Deep is better, but harder to train
- Choose the appropriate divergence function
- Choose regularization
- Choose heuristics
- batch norm, dropout …
- Choose optimization algorithm
- Adagrad / Adam / SGD
- Perform a grid search for hyper parameters (learning rate, regularization parameter, …) on held-out data
- Train
- Evaluate periodically on validation data, for early stopping if required
A simple and clear demostration of 2 variables in a single network ↩︎
The perceptrons in the network are individually capable of sharp changes in output ↩︎