Machine Learning:Gradient Descent

最新推荐文章于 2025-09-09 09:19:30 发布

原创最新推荐文章于 2025-09-09 09:19:30 发布 · 170 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习

机器学习专栏收录该内容

5 篇文章

订阅专栏

这篇博客探讨了梯度下降法的基础，包括参数更新公式和损失函数。它强调了学习率的重要性，指出小学习率可能导致缓慢收敛，而大学可能导致不收敛。接着，介绍了如何通过ηt=η/t+1来动态调整学习率。此外，还讲解了Adagrad算法，利用过去梯度的平方根来调整学习率。文章还涉及了特征缩放的原因和方法，以及随机梯度下降和二阶导数在优化过程中的作用。最后，讨论了在损失函数下降速度减慢时可能结束梯度下降的情况，指出这可能并不总是达到局部最小值。

Firstly,we need to know the basic Gradient Descent:
$θ∗=argmin⁡θL(θ)\theta^*=arg\min_\theta L(\theta)$
( $L$ is loss function,and $θ\theta$ is parameter)
We suppose that $θ\theta$ has two variables ${θ1,θ2}\{\theta_1,\theta_2\}$ .Randomly start at $θ0=[θ10θ20]\theta^0=\begin{bmatrix} \theta^0_1\\ \theta^0_2 \end{bmatrix}$ .And we can get
$[θ11θ21]=[θ10θ20]−η[∂L(θ10)/∂θ1∂L(θ20)/∂θ2]\begin{bmatrix} \theta^1_1\\ \theta^1_2 \end{bmatrix}=\begin{bmatrix} \theta^0_1\\ \theta^0_2 \end{bmatrix}-\eta\begin{bmatrix} \partial L(\theta^0_1)/\partial \theta_1\\ \partial L(\theta^0_2)/\partial \theta_2 \end{bmatrix}$
We know that
$∇L(θ)=[∂L(θ1)/∂θ1∂L(θ2)/∂θ2]\nabla L(\theta)=\begin{bmatrix} \partial L(\theta_1)/\partial \theta_1\\ \partial L(\theta_2)/\partial \theta_2 \end{bmatrix}$
So we can get
$θ1=θ0−η∇L(θ0)\theta^1=\theta^0-\eta\nabla L(\theta^0)$

Tuning your learning rate

Learning rate is $η\eta$ .If $η\eta$ is small,this procedure may consume too much time.But we can get the right $θ\theta$ in the end.If $η\eta$ is large,this procedure may not get the result in the end.And if parameters more than 3,loss function can not be visualized.We can use $θN−Loss\theta^N-Loss$ to observe

Adaptive Learning Rates

There is a useful way to deal with
$ηt=η/t+1\eta^t=\eta/\sqrt{t+1}$

Adagrad

Root mean square.We can show examples to understand.
$LR=ηt/σtLR=\eta^t/\sigma^t$
If t=0:
$σ0=(g0)2\sigma^0=\sqrt{(g^0)^2}$
And
$gn=∇L(θn)g^n=\nabla L(\theta^n)$
if t=1:
$σ1=12[(g0)2+(g1)2]\sigma^1=\sqrt{\frac{1}{2}[(g^0)^2+(g^1)^2]}$
Therefore we can get:
$σt=1t+1∑i=0t(gi)2\sigma^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^i)^2}$
$ηtσt=η∑i=0t(gi)2\frac{\eta^t}{\sigma^t}=\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}$
Finally we can get Gradient Descent:
$wt+1=wt−η∑i=0t(gi)2gtw^{t+1}=w^t-\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}g^t$

Larger Gradient,Larger step?

For a parameter, the greater the differential, the greater the step is correct.But for multiple parameters, differential comparison across parameters is wrong.

Second Derivative

Actually best step is
$derivative\frac{|first\ derivative|}{second\ derivative}$
So we can not judge steps by first derivative.

Stochastic Gradient Descent

Loss function:
$L=∑n(y^n−(b+∑wixin))2L=\sum_{n}(\hat{y}^n-(b+\sum w_ix_i^n))^2$
Gradient Descent:
$wi=wi−1−η∇L(wi−1)w^i=w^{i-1}-\eta\nabla L(w^{i-1})$
In Stochastic Gradient Descent,we only choose one example $x^n$ .Gradient Descent needs to understand all the examples, and then update.But Stochastic Gradient Descent traverses all the examples and performs updates.

Feature Scaling

Feature is input,like $x_1$ or $x_2$ .Make different features have the same scaling.If $x_2$ 's scaling is too large,we should shrink the range.

Why we need feature scaling?

If the value range of $x_1$ and $x_2$ is too large, it will cause the feature with a smaller value range to correspond to a larger parameter value range, that is to say, the parameter range is not equal. If the adaptive learning rate is not used, it will be difficult to perform a gradient descent.We need our Loss Function to be circle not ellipse.And alse if circle,gredient descent will directly point to center of circle.Ellipse will not.

How to feature scaling

Suppose there are many inputs, like $x^1$ , $x^2$ ,…, $x^n$ . Then each input has a feature,like $x^1_1$ , $x21x^1_2$ .We suppose that $x3nx^n_3$ 's scaling is too large.We calculate mean $m_3$ and standard deviation $σi\sigma_i$ .
$x3n=x3n−m3σix_3^n=\frac{x_3^n-m_3}{\sigma_i}$
The means of all dimensions are 0,and the variances are all 1.

Important!

Update the parameters may not reduce loss. And gradient descent may not only stuck at local minima,but also saddle point.Differential value equal to 0.In reality, we may choose to end the gredient descent when the Loss Function decreases slowly. Thinking that this will be close to the local minima, in fact, it may be higher than we think.