Machine Learning:Gradient Descent

这篇博客探讨了梯度下降法的基础,包括参数更新公式和损失函数。它强调了学习率的重要性,指出小学习率可能导致缓慢收敛,而大学可能导致不收敛。接着,介绍了如何通过ηt=η/t+1来动态调整学习率。此外,还讲解了Adagrad算法,利用过去梯度的平方根来调整学习率。文章还涉及了特征缩放的原因和方法,以及随机梯度下降和二阶导数在优化过程中的作用。最后,讨论了在损失函数下降速度减慢时可能结束梯度下降的情况,指出这可能并不总是达到局部最小值。

Firstly,we need to know the basic Gradient Descent:
θ∗=argmin⁡θL(θ)\theta^*=arg\min_\theta L(\theta)θ=argθminL(θ)
(LLL is loss function,and θ\thetaθ is parameter)
We suppose that θ\thetaθ has two variables{θ1,θ2}\{\theta_1,\theta_2\}{θ1,θ2}.Randomly start at θ0=[θ10θ20]\theta^0=\begin{bmatrix} \theta^0_1\\ \theta^0_2 \end{bmatrix}θ0=[θ10θ20].And we can get
[θ11θ21]=[θ10θ20]−η[∂L(θ10)/∂θ1∂L(θ20)/∂θ2]\begin{bmatrix} \theta^1_1\\ \theta^1_2 \end{bmatrix}=\begin{bmatrix} \theta^0_1\\ \theta^0_2 \end{bmatrix}-\eta\begin{bmatrix} \partial L(\theta^0_1)/\partial \theta_1\\ \partial L(\theta^0_2)/\partial \theta_2 \end{bmatrix}[θ11θ21]=[θ10θ20]η[L(θ10)/θ1L(θ20)/θ2]
We know that
∇L(θ)=[∂L(θ1)/∂θ1∂L(θ2)/∂θ2]\nabla L(\theta)=\begin{bmatrix} \partial L(\theta_1)/\partial \theta_1\\ \partial L(\theta_2)/\partial \theta_2 \end{bmatrix}L(θ)=[L(θ1)/θ1L(θ2)/θ2]
So we can get
θ1=θ0−η∇L(θ0)\theta^1=\theta^0-\eta\nabla L(\theta^0)θ1=θ0ηL(θ0)

Tuning your learning rate

Learning rate is η\etaη.If η\etaη is small,this procedure may consume too much time.But we can get the right θ\thetaθ in the end.If η\etaη is large,this procedure may not get the result in the end.And if parameters more than 3,loss function can not be visualized.We can use θN−Loss\theta^N-LossθNLoss to observe

Adaptive Learning Rates

There is a useful way to deal with
ηt=η/t+1\eta^t=\eta/\sqrt{t+1}ηt=η/t+1

Adagrad

Root mean square.We can show examples to understand.
LR=ηt/σtLR=\eta^t/\sigma^tLR=ηt/σt
If t=0:
σ0=(g0)2\sigma^0=\sqrt{(g^0)^2}σ0=(g0)2
And
gn=∇L(θn)g^n=\nabla L(\theta^n)gn=L(θn)
if t=1:
σ1=12[(g0)2+(g1)2]\sigma^1=\sqrt{\frac{1}{2}[(g^0)^2+(g^1)^2]}σ1=21[(g0)2+(g1)2]
Therefore we can get:
σt=1t+1∑i=0t(gi)2\sigma^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^i)^2}σt=t+11i=0t(gi)2
ηtσt=η∑i=0t(gi)2\frac{\eta^t}{\sigma^t}=\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}σtηt=i=0t(gi)2η
Finally we can get Gradient Descent:
wt+1=wt−η∑i=0t(gi)2gtw^{t+1}=w^t-\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}g^twt+1=wti=0t(gi)2ηgt

Larger Gradient,Larger step?

For a parameter, the greater the differential, the greater the step is correct.But for multiple parameters, differential comparison across parameters is wrong.

Second Derivative

Actually best step is
∣first derivative∣second derivative\frac{|first\ derivative|}{second\ derivative}second derivativefirst derivative
So we can not judge steps by first derivative.

Stochastic Gradient Descent

Loss function:
L=∑n(y^n−(b+∑wixin))2L=\sum_{n}(\hat{y}^n-(b+\sum w_ix_i^n))^2L=n(y^n(b+wixin))2
Gradient Descent:
wi=wi−1−η∇L(wi−1)w^i=w^{i-1}-\eta\nabla L(w^{i-1})wi=wi1ηL(wi1)
In Stochastic Gradient Descent,we only choose one example xnx^nxn.Gradient Descent needs to understand all the examples, and then update.But Stochastic Gradient Descent traverses all the examples and performs updates.

Feature Scaling

Feature is input,like x1x_1x1 or x2x_2x2.Make different features have the same scaling.If x2x_2x2's scaling is too large,we should shrink the range.

Why we need feature scaling?

If the value range of x1x_1x1 and x2x_2x2 is too large, it will cause the feature with a smaller value range to correspond to a larger parameter value range, that is to say, the parameter range is not equal. If the adaptive learning rate is not used, it will be difficult to perform a gradient descent.We need our Loss Function to be circle not ellipse.And alse if circle,gredient descent will directly point to center of circle.Ellipse will not.

How to feature scaling

Suppose there are many inputs, like x1x^1x1, x2x^2x2,…, xnx^nxn. Then each input has a feature,likex11x^1_1x11,x21x^1_2x21.We suppose that x3nx^n_3x3n's scaling is too large.We calculate mean m3m_3m3 and standard deviation σi\sigma_iσi.
x3n=x3n−m3σix_3^n=\frac{x_3^n-m_3}{\sigma_i}x3n=σix3nm3
The means of all dimensions are 0,and the variances are all 1.

Important!

Update the parameters may not reduce loss. And gradient descent may not only stuck at local minima,but also saddle point.Differential value equal to 0.In reality, we may choose to end the gredient descent when the Loss Function decreases slowly. Thinking that this will be close to the local minima, in fact, it may be higher than we think.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值