Firstly,we need to know the basic Gradient Descent:
θ∗=argminθL(θ)\theta^*=arg\min_\theta L(\theta)θ∗=argθminL(θ)
(LLL is loss function,and θ\thetaθ is parameter)
We suppose that θ\thetaθ has two variables{θ1,θ2}\{\theta_1,\theta_2\}{θ1,θ2}.Randomly start at θ0=[θ10θ20]\theta^0=\begin{bmatrix}
\theta^0_1\\
\theta^0_2
\end{bmatrix}θ0=[θ10θ20].And we can get
[θ11θ21]=[θ10θ20]−η[∂L(θ10)/∂θ1∂L(θ20)/∂θ2]\begin{bmatrix}
\theta^1_1\\
\theta^1_2
\end{bmatrix}=\begin{bmatrix}
\theta^0_1\\
\theta^0_2
\end{bmatrix}-\eta\begin{bmatrix}
\partial L(\theta^0_1)/\partial \theta_1\\
\partial L(\theta^0_2)/\partial \theta_2
\end{bmatrix}[θ11θ21]=[θ10θ20]−η[∂L(θ10)/∂θ1∂L(θ20)/∂θ2]
We know that
∇L(θ)=[∂L(θ1)/∂θ1∂L(θ2)/∂θ2]\nabla L(\theta)=\begin{bmatrix}
\partial L(\theta_1)/\partial \theta_1\\
\partial L(\theta_2)/\partial \theta_2
\end{bmatrix}∇L(θ)=[∂L(θ1)/∂θ1∂L(θ2)/∂θ2]
So we can get
θ1=θ0−η∇L(θ0)\theta^1=\theta^0-\eta\nabla L(\theta^0)θ1=θ0−η∇L(θ0)
Tuning your learning rate
Learning rate is η\etaη.If η\etaη is small,this procedure may consume too much time.But we can get the right θ\thetaθ in the end.If η\etaη is large,this procedure may not get the result in the end.And if parameters more than 3,loss function can not be visualized.We can use θN−Loss\theta^N-LossθN−Loss to observe
Adaptive Learning Rates
There is a useful way to deal with
ηt=η/t+1\eta^t=\eta/\sqrt{t+1}ηt=η/t+1
Adagrad
Root mean square.We can show examples to understand.
LR=ηt/σtLR=\eta^t/\sigma^tLR=ηt/σt
If t=0:
σ0=(g0)2\sigma^0=\sqrt{(g^0)^2}σ0=(g0)2
And
gn=∇L(θn)g^n=\nabla L(\theta^n)gn=∇L(θn)
if t=1:
σ1=12[(g0)2+(g1)2]\sigma^1=\sqrt{\frac{1}{2}[(g^0)^2+(g^1)^2]}σ1=21[(g0)2+(g1)2]
Therefore we can get:
σt=1t+1∑i=0t(gi)2\sigma^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^i)^2}σt=t+11i=0∑t(gi)2
ηtσt=η∑i=0t(gi)2\frac{\eta^t}{\sigma^t}=\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}σtηt=∑i=0t(gi)2η
Finally we can get Gradient Descent:
wt+1=wt−η∑i=0t(gi)2gtw^{t+1}=w^t-\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}g^twt+1=wt−∑i=0t(gi)2ηgt
Larger Gradient,Larger step?
For a parameter, the greater the differential, the greater the step is correct.But for multiple parameters, differential comparison across parameters is wrong.
Second Derivative
Actually best step is
∣first derivative∣second derivative\frac{|first\ derivative|}{second\ derivative}second derivative∣first derivative∣
So we can not judge steps by first derivative.
Stochastic Gradient Descent
Loss function:
L=∑n(y^n−(b+∑wixin))2L=\sum_{n}(\hat{y}^n-(b+\sum w_ix_i^n))^2L=n∑(y^n−(b+∑wixin))2
Gradient Descent:
wi=wi−1−η∇L(wi−1)w^i=w^{i-1}-\eta\nabla L(w^{i-1})wi=wi−1−η∇L(wi−1)
In Stochastic Gradient Descent,we only choose one example xnx^nxn.Gradient Descent needs to understand all the examples, and then update.But Stochastic Gradient Descent traverses all the examples and performs updates.
Feature Scaling
Feature is input,like x1x_1x1 or x2x_2x2.Make different features have the same scaling.If x2x_2x2's scaling is too large,we should shrink the range.
Why we need feature scaling?
If the value range of x1x_1x1 and x2x_2x2 is too large, it will cause the feature with a smaller value range to correspond to a larger parameter value range, that is to say, the parameter range is not equal. If the adaptive learning rate is not used, it will be difficult to perform a gradient descent.We need our Loss Function to be circle not ellipse.And alse if circle,gredient descent will directly point to center of circle.Ellipse will not.
How to feature scaling
Suppose there are many inputs, like x1x^1x1, x2x^2x2,…, xnx^nxn. Then each input has a feature,likex11x^1_1x11,x21x^1_2x21.We suppose that x3nx^n_3x3n's scaling is too large.We calculate mean m3m_3m3 and standard deviation σi\sigma_iσi.
x3n=x3n−m3σix_3^n=\frac{x_3^n-m_3}{\sigma_i}x3n=σix3n−m3
The means of all dimensions are 0,and the variances are all 1.
Important!
Update the parameters may not reduce loss. And gradient descent may not only stuck at local minima,but also saddle point.Differential value equal to 0.In reality, we may choose to end the gredient descent when the Loss Function decreases slowly. Thinking that this will be close to the local minima, in fact, it may be higher than we think.