13.4 Summary
Finally, we’ve done it. Until this chapter, we haven’t been that close to machine learning, but now, we are right at the heart of it. Gradient descent is the number one algorithm to train neural networks. Yes, even state-of-the-art ones.
It all starts with calculus. To reach the heights of gradient descent, we studied the relations between monotonicity, local extrema, and the derivative. The pattern is simple: if f′(a)/span>0, then f is increasing, but if f′(a)/span>0, then f is decreasing around a. Speaking in terms of physics, if the speed is positive, the object is moving away, but if the speed is negative, the object is coming closer.
Based on this observation, we can deduce necessary and sufficient conditions to find local minima and maxima: if f′(a) = 0
- and if f′′(a)/span>0, then a is a local minimum,
- but if f′(a) = 0 and f′′(a)/span>0, then a is a local maximum.
So, finding the local...